Comparison of Rust async and Linux thread context switch time and memory use

klodolph · on Feb 12, 2021

I am not surprised that the cost of context switching due to I/O readiness can often be roughly equal between async tasks and kernel threads. Normal blocking I/O can be surprisingly efficient because of various factors, such as a reduced need for system calls.

Think about it this way—if you have a user-space thread which wakes up due to I/O readiness, then this means that the relevant kernel thread woke up from epoll_wait() or something similar. With blocking I/O, you call read(), and the kernel wakes up your thread when the read() completes. With non-blocking I/O, you call read(), get EAGAIN, call epoll_wait(), the kernel wakes up your thread when data is ready, and then you call read() a second time.

In both scenarios, you’re calling a blocking system call and waking up the thread later.

Of course, there are scenarios when epoll_wait() returns multiple events, which reduces the number of context switches. But the general result is that it’s not always easy to beat blocking I/O and kernel threads.

singron · on Feb 12, 2021

Google found that the main cost of context switching isn't really in the syscall boundary but in the task scheduling. That's why their linux fork has optional userspace scheduling of kernel threads with the switchto syscalls[0]. Essentially, if your thread already knows which thread should run next, it can context switch to it without having to schedule in the kernel, which is exactly the situation in these bucket brigade benchmarks.

This benchmark as written is probably underestimating kernel task scheduling cost since only 1 task is runnable at any 1 time, while a realistic multi-threaded system will have more runnable threads to juggle.

0: http://pdxplumbers.osuosl.org/2013/ocw//system/presentations...

otabdeveloper4 · on Feb 12, 2021

Yes, the point of "async" isn't to save CPU cycles, it's to customize the scheduler so that you can prioritize resource use properly.

(E.g., don't switch to the systemd or sshd thread if a customer's web request is timing out.)

That said, doing this right is out of reach of the average programmer, and it's doubtful that the compiler has enough domain-specific knowledge to do this automatically.

"Async" of the Python and node.js fame is yet another thing, a hack to get around their interpreters' inability to use kernel multitasking features because of global locks.

dmw_ng · on Feb 12, 2021

Linux is likely many years from having anything approaching a fully asynchronous system call interface, if anyone was willing to work on it (io_uring makes a huge dent but I don't think it's intending to reimplement everything). Even where async kernel interfaces exist, without reworking of the kernel-internal implementation still there is often the need for a thread for the kernel side to execute on. For example IIRC this is true for swathes of the vfs implementation at present.

So the whole thing is a bit of a false equivalence. Better interfaces that reduce context switches are desirable, but even where they exist often you are just substituting a user thread for a kernel one, and in the general case, there is likely to always be system interfaces that never make it into the brave new world -- take SysV IPC for example (a 1975 era API), it seems doubtful anyone would put the effort into making it async, but there will probably still be times where you might want to consume those interfaces for compatibility or some other obscure reason.

Also consider the case where a user program has a need for some substantial thread pools of its own, it might be the case in some scenarios that reusing resources that must already exist in user space and live in warmed caches makes more sense. Neither async or Linux threads are "better", it will always depend on a particular use case, and even then the right answer might well be some combination of both.

Tuna-Fish · on Feb 12, 2021

> io_uring makes a huge dent but I don't think it's intending to reimplement everything

At this point it's proponents are being pretty unapologic that it will, in fact, reimplement every part of the syscall interface that is actively used.

lathiat · on Feb 12, 2021

I'm not sure about io_uring not reimplementing everything.. it seems to be gaining more and more scope.

throwaway81523 · on Feb 12, 2021

Another possibility at least in the case of lots of network sockets is DPDK, avoiding almost all the context switches if the user side is async.

continuations · on Feb 12, 2021

Then why is it that IO-heavy benchmarks such as the Techempower web benchmark are dominated by async frameworks? The fastest results there are all from async frameworks [1].

And among Rust frameworks the same pattern holds. The fastest Rust frameworks are async while a synchronous frmework such as Rocket is about 20x slower.

[1] https://www.techempower.com/benchmarks/#section=data-r20&hw=...

[2] https://www.techempower.com/benchmarks/#section=data-r20&hw=...

lmm · on Feb 12, 2021

Those benchmarks measure one very specific scenario: serving lots of small requests concurrently. Async handles that well because that's exactly the scenario where a single epoll_wait() call will return lots of events.

infogulch · on Feb 12, 2021

Is there a different benchmark that demonstrates a scenario where synchronous syscalls are better suited?

emn13 · on Feb 12, 2021

Presumably the difference would be smaller or for some frameworks even negative if each request did some actual and not entirely predictable amount of CPU work (e.g. executing some html templating scenario with varying levels of output and perhaps compression), and just in general much more work and using more memory (so the memory overhead is proportionally less relevant), and if the benchmark implementations were not permitted to tune exactly for the workload and system (i.e. so that generalized scheduler defaults are used on both kernel and userspace side). I.e., in a more real-world scenario with all the normal complexities and inefficiencies and development time constraints that are usual.

But yeah, it's be super interesting to actually see that demonstrated - that'd be quite a lot of work, however.

lmm · on Feb 12, 2021

I doubt you'll find anything as comprehensive and well-presented as the TechEmpower benchmarks, because their particular scenario is one that a lot of frameworks care about competing on (partly because it's difficult enough to be interesting). But I'd expect any benchmark for batch-style processing of large volumes of data would show that.

shihab · on Feb 12, 2021

1. https://news.ycombinator.com/item?id=23496994

2. https://techspot.zzzeek.org/2015/02/15/asynchronous-python-a...

volta83 · on Feb 12, 2021

If your request are huge. For example, imagine you need to read many huge files into memory.

Whether you read one file after the other sequentially, or try to read all of them concurrently, won't make a difference, because your Disk/RAM bandwidth is going to be bottlenecked anyways.

Trying to do this concurrently requires more work that won't pay off, so it might actually be slower.

klodolph · on Feb 12, 2021

Benchmarks are not everything, and the difference between asynch/synchronous operation is not the only thing the benchmark is testing (each of these different frameworks appear to have their own system for parsing and representing HTTP requests). You should know what usage patterns YOUR application sees, understand the relative cost of engineering time and CPU time for YOUR application, and do tests in YOUR environment.

jacoblambda · on Feb 12, 2021

I'd argue that it's because even though blocking IO is cheaper, it's very difficult to maximise performance in a multithreaded/concurrent context.

You could make faster code with it but I wouldn't want to maintain it and you'd have to throw an obscene amount of man hours at it to get that performance.

otabdeveloper4 · on Feb 12, 2021

> Then why is it that IO-heavy benchmarks such as the Techempower web benchmark are dominated by async frameworks?

Probably because they forgot to enable realtime priority for threads in the synchronous frameworks.

Failing to do that means Linux will starve your web request handling threads in favor of various system tasks you don't care about.

inimino · on Feb 12, 2021

Rocket is slow because of its design, not because it is insufficiently asynchronous.

Freaky · on Feb 12, 2021

It's beaten by half a dozen sync Ruby implementations, which should be a pretty good hint that something else is going on.

Lack of HTTP keep-alive is probably the most obvious thing holding it back.

titzer · on Feb 12, 2021

I mostly agree with you (not the least of which is that blocking I/O is a damn fine API), but the reason that people use async I/O is to have lots of outstanding requests. Typically you would use select (or similar) to service whichever one responds first. That way you can multiplex many I/O streams onto a small number of threads. If threads are memory-intensive, you almost certainly have to do this.

khuey · on Feb 12, 2021

io_uring will help a lot here.

klodolph · on Feb 12, 2021

Well, it can, but not always. Remember that if you’re waiting for an event to arrive, that generally involves a syscall, the thread being put to sleep, and then the thread being woken up. Any time you’re doing that, think, “Could I just replace this polling system with a call to read()?”

What io_uring does do is provide a way to poll without needing to wait, but if you haven’t received new events when you poll, you’re not on the fast path any more. Whether you are often on the fast path for io_uring will depend on the particulars of your application and its I/O patterns.

scottlamb · on Feb 12, 2021

> What io_uring does do is provide a way to poll without needing to wait, but if you haven’t received new events when you poll, you’re not on the fast path any more.

Isn't "not on the fast path any more" a bit absolutist? io_uring's "slow" path is roughly one syscall per iteration, right? That's still many fewer syscalls than one syscall per IO operation (or more if any return EAGAIN/EWOULDBLOCK) as you'd be doing without it. I'm not sure I really care about eliminating that last syscall per iteration; it seems minor in comparison.

volta83 · on Feb 12, 2021

> io_uring's "slow" path is roughly one syscall per iteration, right?

Right, so if a blocking API makes 1 syscall, io_uring would make N syscals for N iterations.

_vvhw · on Feb 12, 2021

Where 1 iteration and 1 syscall to io_uring_enter() is submitting 100s of I/O operations per io_uring_enter() syscall (and you can even run the ring buffers with the kernel set to poll so you can do 0 syscalls if that's not already enough).

That's pretty huge amortization. Rough benchmarks we've done are showing double throughput for io_uring for 4096 byte AF sector write/fsync/read combos: https://github.com/coilhq/tigerbeetle/tree/master/demos/io_u...

volta83 · on Feb 12, 2021

> Where 1 iteration and 1 syscall to io_uring_enter() is submitting 100s of I/O operations per io_uring_enter() syscall (and you can even run the ring buffers with the kernel set to poll so you can do 0 syscalls if that's not already enough).

Same for the blocking case. If I do a syscall to read a whole file, its just 1 syscall creating millions of I/O operations.

_vvhw · on Feb 12, 2021

Sure, but that's all you'll ever do with the blocking case: 1 syscall at a time, while your program sits and does nothing with the CPU, whereas with io_uring at least you can do CPU while you wait on your IO. So even ignoring the IORING_SQPOLL option that requires no io_uring_enter() syscall, a basic usage of io_uring is still going to be faster.

io_uring is a bicycle for IO, and you can ride it as fast as you want to. But it's apples and oranges to blocking IO, which is always stuck in first gear.

volta83 · on Feb 12, 2021

> while your program sits and does nothing with the CPU

The CPU can run other threads while the hardware does DMA transfers. The thread just yields when the transfer is started, and a hardware exception wakes it up when the DMA transfer finishes.

_vvhw · on Feb 12, 2021

Sure, but we're comparing the efficiency of one of your program's single threads, because otherwise you could take that same argument you just used and turn it around and say fine, just run another thread then with another io_uring... and you're still ahead. You have to compare at the smallest unit of control plane.

At the same time, multiple threads for a single program introduce context switches which are becoming horrendously expensive compared to the sheer number of IOPS that modern NVMe SSDs can do.

Thread-per-core designs built around io_uring are the future of IO on Linux.

klodolph · on Feb 12, 2021

io_uring’s slow path is making one blocking syscall every time you would ordinarily make a blocking syscall.

I am a bit baffled how this could possibly be considered an “absolutist” viewpoint—I am just saying that there exist scenarios where io_uring is not helpful. This should be uncontroversial.

_vvhw · on Feb 12, 2021

> io_uring’s slow path is making one blocking syscall every time you would ordinarily make a blocking syscall.

That's not correct, io_uring was "absolutely" designed, at least in the technical sense, for zero syscalls in the slow path (if you want to):

  IORING_SETUP_SQPOLL
  When this flag is specified, a kernel thread is created to perform submission queue polling.
  An io_uring instance configured in this way enables an application to issue I/O without ever
  context switching into the kernel. By using the submission queue to fill in new submission
  queue entries and watching for completions on the completion queue, the application can submit
  and reap I/Os without doing a single system call.

From the man page: https://manpages.debian.org/unstable/liburing-dev/io_uring_s...

This mode required privileges in early kernel versions but that's already changed. Things are moving fast.

scottlamb · on Feb 12, 2021

You mean when there's only one thing to do per iteration? I'd describe that as when mostly idle. As the system gets more loaded, the one syscall per iteration matters less and less.

quotemstr · on Feb 12, 2021

io_uring is specifically designed so that zero system calls are necessary while the system is busy. Userspace and the kernel both update ring buffers, and ring buffers can be checked and drained without entering the kernel at all.

klodolph · on Feb 12, 2021

Yes, that’s what “if you haven’t received new events when you poll” means.

the8472 · on Feb 12, 2021

io_uring doesn't mean async though. You can also use it for blocking batch execution of syscalls. E.g. when you need to stat hundreds of files or wait for several child processes at once. So with some batch-oriented convenience wrappers it can help threaded code too.

lxpz · on Feb 12, 2021

> People often see that there's some theoretical benefit of async and then they accept far less ergonomic coding styles and the additional bug classes that only happen on async due to accidental blocking etc... despite the fact that when you consider a real-world deployed application, those "benefits" become indistinguishable from noise. However, due to the additional bug classes and worse ergonomics, there is now less energy for actually optimizing the business logic, which is where all of the cycles and resource use are anyway, so in-practice async implementations tend to be buggier and slower.

I disagree with this. I feel like using Async programming is actually much more powerfull and expressive than theaded programming, especially with Rust combinators on streams of futures (for examples: futures_unordered), which allow to trivially express complex concurrency patterns (such as: wait for the first two requests to return something and discard the third request's response, and btw also cancel that request). Async programming also allows for structured programming, where each task is an owned resource of a parent tasks, which means that lifetimes of tasks can be controlled and runaway threads can't exist (if one is avoiding tokio::spawn). I've been developping [Garage](https://git.deuxfleurs.fr/Deuxfleurs/garage) for some time now (a simple distributed object store that implements a subset of S3, not ready for production!), and I've been in awe about how easy it was to write these complex patterns using async Rust.

sambe · on Feb 12, 2021

The way you have quoted this suggests it is being said by the author, whom you are disagreeing with. It's not: it's being said by someone else (in an issue filed against the repo), and the author also mostly disagrees.

lxpz · on Feb 12, 2021

Sorry, yes this is not from the mouth of the author, however he seemed to agree with the premise that async/await is unergonomical and that performance is its only reason to exist, which I am trying to dispute (at least in the context of how Rust does it, which is much better than the JS version for instance)

Woung1938 · on Feb 12, 2021

Indeed. I read it as something written by the author. Double-checking revealed it was written by spacejam that did post the same argumentation over and over here on HN.

krenoten · on Feb 13, 2021

I haven't mentioned async Rust on HN since 2015.

ceronman · on Feb 12, 2021

It seems to me that these are two orthogonal topics. One thing is how you represent tasks, either using OS threads or async tasks. And the other is how you structure concurrency. Maybe I'm missing something, but I think there is nothing preventing the use of those structured concurrency patterns using OS threads as the base for tasks. Then you get some nice benefits of doing this such as proper stack-traces and easier debugging.

The killer use case for async tasks is when you need hyper-concurrency, e.g. hundreds of thousands of concurrent tasks. In that case, as the article mentions, you can't use OS threads anymore. Of course there are some use cases requiring this level of concurrency, messaging servers come to my mind, but there are also many, many use cases were you need a lower level of concurrency, like a few hundred concurrent tasks max. In those cases I think using OS threads can work pretty well with less complexity.

lxpz · on Feb 12, 2021

The advantage of async tasks for structured concurrency lies in task cancellation, which is intrinsically linked to the notion of "task ownership". If you are using an OS thread to offload some task, and then realize that you don't need that task's result anymore, your safest bet is to let the thread run until the end and then discard the results it produces. Other options include adding custom cancellation logic to the thread and remembering to call it at the appropriate time. Nobody checks that you are doing this correctly, which means you may leak resources such as the thread's memory or a TCP connection. On the other hand when using async/await in Rust, the fact of owning a future (i.e. owning the promise that will return you the value when it's done) implies ownership of the task's resources, such as memory, file descriptors, or TCP connections. Dropping the future before it completes means that the task will stop and all resources will be freed/closed immediately, and this is checked statically by the compiler.

notacoward · on Feb 12, 2021

POSIX thread cancellation has existed with defined (though complex) semantics for ages. It's a ginormous ugly mess, but it is an alternative to run-to-completion or custom logic.

krenoten · on Feb 13, 2021

In practice this is the source of a tremendous number of bugs when async tasks don't actually clean up shared state in their Drop impls.

spankalee · on Feb 12, 2021

Anywhere you have an .await in async code you could have a checkpoint in a thread that allows for cancellation. That's the main cancellation advantage - that the author is forced to write those to consume other async functions.

xfer · on Feb 12, 2021

All haskell threads are cancellable. This does mean you have to take extra care when using certain constructs.

lostcolony · on Feb 12, 2021

So one of the things I realized writing Erlang is that when concurrency is 'free' (or so close as to be indistinguishable in most use cases), more things end up being easy to write concurrently than we traditionally think.

An instance I ran into personally was, effectively, task scheduling. Sure, I could have done the 'normal' thing, of a priority queue being populated from the database on some interval, having some thread reading from that queue, sleeping until the first item needs work, pulling it off, throwing it onto a threadpool. Have to take care to ensure the threadpool is large enough for the maximum amount of concurrency I need, have to make sure that I'm careful in what data structure I use for the priority queue (I need to make sure I'm not adding the same task multiple times to it, and that when adding items to it I'm not locking it), make sure the polling thread can't throw (or at least, when it does, it restarts or kills the program and that then restarts), a few other niggles here and there too. And a whole 'nother level of complexity if tasks lead to follow up tasks (i.e., a task represents a state machine through a series of transitions, which themselves take a sizable amount of time, to where just leaving them on the thread is a bad idea, since it uses up the threadpool).

In a 'free concurrency' world, I just spin up a new concurrent process per task for some window (same as how many items I added to the priority queue). And that's basically it. Each process can step through its state machine, sleeping in between tasks for however long, without issue.

jerf · on Feb 12, 2021

I think the more important aspect of that quote is just about performance vs. code. I see many cases where people are hyperoptimizing on whether or not their framework consumes 2 or 15 microseconds per request when the work they are going to do takes 100 milliseconds.

If you like the async style better, then fine, use it. Sometimes you win like that, where the thing you like better is also faster. But don't worry so much about the performance.

Web frameworks is another place I see this a lot. Crossing the streams, if you've got an incoming web request, unless your framework somehow consumes and discards the web headers, a real web request is already many kilobytes just to represent the incoming headers by the time it gets to your handling code. Using async because it has ~200 bytes per task vs a thread allocating 10K out of the box at that point doesn't make much difference because the HTTP request itself is blowing out the difference.

The spread in orders of magnitude in what is expensive and what is not has gotten so significant on modern systems that you can easily get developers sitting there optimizing nanoseconds while throwing away seconds. The old school assembly-style premature optimization where we're trying to save every bit and cycle has mostly passed away, but its replacement seems to be this; frantically benchmarking how many millions of requests per second some framework or feature can handle as if it matters when your code is going to take 500ms.

heavenlyblue · on Feb 12, 2021

A lot of web requests are tens of millliseconds due to the latency of speaking to the database. The abolity to fire several requests to the database instead of serialising them is one of this optimisations that you really can’t do in threaded model without introducing other asynchronous systems components.

skyde · on Feb 12, 2021

you have a good point about http request size. But any high performance framework would not buffer the whole request to later parse it. it will do incremental parsing.

Meaning you don’t need to read all the bytes from the tcp socket before deciding which route to take. And also the handler for that route is given a stream object and will just read as many byte as it need.

jiggawatts · on Feb 12, 2021

Speaking of futures_unordered and similar patterns, I think a part of the "async promise" that has failed is the lack of concurrency for a single user request by default in most languages.

That is, the 'easy' path is to write code such as the following (in vaguely C# pseudocode):

    var p = await GetUserPermission( username );
    var c = await GetServerConfig();
    var m = await GetMessageOfTheDay();

Assume each await call is potentially an expensive SQL query or REST API call.

The problem with that is that this is strictly sequential, synchronous code that is merely "dehydrated" and "rehydrated" to reduce overheads during the waiting periods. It is strictly slower when executed on a server that is not very busy! It must be, because it does the exact same work in the exact same order as the ordinary synchronous version, except now with extra state machinery and complex error handling woven throughout by the compiler.

Scalability is not everyone's concern. Scalability is for the FAANG sized companies. I care about the individual user experience, and async does nothing for that by default.

I mean, sure, you can write much more verbose code along the lines of:

        var p_t = GetUserPermission(username);
        var c_t = GetServerConfig();
        var m_t = GetMessageOfTheDay();
        await Task.WhenAll(new Task[] { p_t, c_t, m_t });
        var p = p_t.Result;
        var c = c_t.Result;
        var m = m_t.Result;

But noone does this, for some values of noone. I've never seen code like this in the field.

In fact, let's test this. I'm reviewing an asynchronous ASP.NET application developed in 2020 right now. It's a large app, with literally thousands of uses of the "await" keyword, at least 3500 files use it.

The only use of "Task" static methods are seven uses of FromResult(). That's it. Zero uses of WaitAll(), WaitAny(), or ContinueWith()!

This is typical.

It's not that asynchronous programming is hard, it's that it is unergonomic to gain a latency benefit out of it. Most applications need lower latency, not higher throughput. Hence, for most programmers, most of the time, asynchronous programming is next to useless. It's just extra noise and more failure modes.

lxpz · on Feb 12, 2021

That .NET syntax using Task.WhenAll seems quite bad, which might be part of the reason why not many people bother (disclaimer: I don't do C# or ASP.NET). In Rust it would be:

    let p, c, m = join!(
        GetUserPermission(username),
        GetServerConfig(),
        GetMessageOftheDay()
    );

(you don't even have to write await when using the join macro)

With such simple syntax available it seems obvious to me that one would want to use it as often as possible, and it's also much simpler (and probably cheaper) than dispatching those three tasks to a thread pool.

moocowtruck · on Feb 12, 2021

var r = await Task.Whenall(f1,f2,f3); Console.WriteLine($"{r[0]}, {r[1]}, {r[2]}");

f1,f2,f3 are all async fn's, that is all you have to do

evntdrvn · on Feb 12, 2021

yeah, the original example is showing the unwieldy version of the syntax.

jiggawatts · on Feb 13, 2021

The original example by me did not assume that asynchronous functions all return the same result type.

Mist opportunities for concurrency are between unrelated tasks (because related tasks often dependencies between them). Unrelated tasks tend to have unrelated return types.

concretemarble · on Feb 15, 2021

When tasks are unrelated, you also likely don't need them all at the same time for the next stage of pipeline. You can simply await it when you need it.

  var p_t = GetUserPermission(username);
  var c_t = GetServerConfig();
  var m_t = GetMessageOfTheDay();
  function_to_call1(await p_t, await c_t);
  function_to_call2(await m_t);

This does not look any more complicated than a non-async function. Not sure how this example justifies your claims.

Besides, even if your example is valid, the usage of Task.WhenAll has nothing to do with your claims either. The use of async/await is majorly for scalability. Being able to make several network calls concurrently is not the major concern. Even if you await at each async call, you still achieve better scalability because threads won't be blocked for async calls and can work on something else.

IgorPartola · on Feb 12, 2021

I guess I am that no one. I come at all this from writing queues from scratch and using threads or processes for concurrency. I also had a lot of fun writing my own networking hot loops with select/poll/spill/kqueue when my work needed it, so I guess I am extra sensitive to making concurrent things actually concurrent. But I would not dream of making three independent requests like that sequentially. There are other patterns you can use besides waiting for all tasks to finish, especially if you can do some processing after the first are done, but all in all why wouldn’t you make them concurrent aside from liking seeing await/async all over the place?

littlestymaar · on Feb 12, 2021

In Javascript, this is a typical rookie mistake. Every newcomer would do it once, get lectured about `Promise.all` in code review, and move on.

Honestly, I'd be really surprised if this was a common practice in C#.

baby · on Feb 12, 2021

Rust doesn’t allow you to do this.

littlestymaar · on Feb 12, 2021

What makes you think so?

From a quick ddg search, it looks it does: https://docs.rs/futures/0.3.8/futures/macro.join.html

imtringued · on Feb 12, 2021

All you have to do is wrap multiple futures into a single one and then await on the combined one. There is no programming language on earth that can prevent this.

evntdrvn · on Feb 12, 2021

My teams/company uses it all over, so maybe depends on the context you work in?

And FWIW, this explicit form is often unnecessary - if you kick off each task they will run in parallel and then just await each task only when the result is needed, it can look a lot cleaner:

        var p_t = GetUserPermission(username);
        var c_t = GetServerConfig();
        var m_t = GetMessageOfTheDay();

        var foo = isAuthorized(await p_t);
        // more code here
        var msg = ( (await c_t).ServerName + await m_t) );

littlestymaar · on Feb 12, 2021

True, this doesn't work in Rust though, because nothing at all happens before the first time you poll a future, so you need an explicit task (but as other pointed out, it's pretty straightforward thanks to the `join!` macro).

evntdrvn · on Feb 16, 2021

Same as F# It uses 'cold' tasks, unlike C# that uses 'hot' tasks that start running immediately. But the F# way can compose in more advanced ways.

the8472 · on Feb 12, 2021

Call me noone then.

I write this kind of stuff all the time because parallelizing long-running tasks without dependencies is one of the easiest wins when it comes to wall-time.

But this kind of optimization is somewhat orthogonal to async/await. You don't need fine-grained async to optimize long-running tasks, you could just throw a bunch of closures into a threadpool for that purpose. Async only makes sense when you're interleaving thousands of tasks with readiness/completion based IO.

GordonS · on Feb 12, 2021

It's rare that I personally write these kind of optmisations in web apps, but quite often do for backend processing services (and then, only for "embarrassingly async" operations such as hitting a database or HTTP API).

jayd16 · on Feb 12, 2021

You don't need to create a Task[] because WhenAll is set up for varargs. This is fine:

    await Task.WhenAll(p_t, c_t, m_t);

Or you can just await the threads before you need them. They're already started and running at this point.

You also probably want to avoid using Result and just await the completed task for the nicer unwrap syntax. Plus, you don't want to get into the habit of using Result as its a blocking call. Same with WaitAll and WaitAny. Ideally you would never use those. ContinueWith is also not very needed if your style is to use the more plain await syntax. Those methods are more to bridge blocking and async code so an async from the start app might use async extensively and never those methods.

Perhaps search for WhenAny and WhenAll?

twic · on Feb 12, 2021

I have written a few programs like this - but not in languages which have async/await! In languages with manual async, getting here by refactoring is fairly easy.

GordonS · on Feb 12, 2021

I've been using C# for around 20 years, basically since it was first released.

I never personally had any issue with working with threads and locks, finding it simple enough to reason about them, though I understand lots of people felt differently. When async/await first came to C# around 10 years ago, I grumbled because I didn't see the point; I found it much harder to reason about the flow of code, and initially at least, stack traces were a shitshow (things are much improved, but there is still a lot of cruft in async stack traces).

But async/await was heavily pushed, and "real" threading is almost relegated to the sidelines for most developers. Although having said that, I find that junior devs in particular really struggle to really grok async/await.

Anyway, several more years on, and I have mixed feelings about async. Because Microsoft has gone all-in on async/await, I think it's really easy to work with when building web apps and APIs with ASP.NET Core/MVC - there is barely any "developer overhead" at all, really. Web apps very often hit things like HTTP APIs and databases, and with how easy it now is, there is little reason not to use async/await. Yes, for small loads there is a tiny performance loss due to the runtime setting up async state machines, but it really is almost always completely insignificant - even moreso with the advent of ValueTask, and again more recently with pooled ValueTasks. Yet the gains can be tremendous.

But for non-web apps/APIs, I feel differently. I spend a lot of time writing server-side processing services, and things like Windows services for desktops (in the infosec space), and I've gone all-in on async/await because Microsoft has gone async-first. Hell, a lot of stuff is async only now, so unless you want `.GetAwaiter().GetResult()` everywhere, you have little choice. Anyway, these systems are more complex than web apps, because with web apps, most of the real complexity is hidden away in the framework. But here you have to deal with work queues, caching, pooling, serialisation etc all by yourself. And with async/await, it can be hard to reason about the flow of code, and it's really easy to break things in ways that are really painful to diagnose. And it means that every.single.stacktrace contains async cruft that you need to sift through. Which is not fun.

Anyway, this is much longer than I meant, but my conclusion is that I'll continue to use async/await for web apps and REST APIs (because, why not), but for services, I'm going back to the threadpool, green threads and synchronization primitives, and only using async/await in a limited way where it provides clear value - not async all the way down from the entrypoint.

Welcome back, my beautiful, green threads! (⌐■_■)

mikeschmatz · on Feb 12, 2021

> Anyway, this is much longer than I meant, but my conclusion is that I'll continue to use async/await for web apps and REST APIs (because, why not), but for services, I'm going back to the threadpool, green threads and synchronization primitives, and only using async/await in a limited way where it provides clear value - not async all the way down from the entrypoint.

AFAIK, .Net doesn't support "green threads" and they repeatedly confirmed that there are no future plans to do so. Additionally, M:N threading model has serious interop issues as evident in Go, which is a no-go for system languages. Personally, I don't see a need for green threads since kernel threads are fast enough and don't use that much RAM as people tend to believe. And when they are not, sure, go async/await.

GordonS · on Feb 12, 2021

TIL: I've been using the term "green threads" incorrectly for years! [0]

I had actually meant "normal", OS-level threads.

[0] https://stackoverflow.com/a/42454139/25758

littlestymaar · on Feb 12, 2021

I find your comment about stack traces a bit weird: of course, when all your work is sequential and you can use only threads, you will have a nice stack trace for free, when async stack traces need a lot of support from the tooling.

But most of the time you not only use thread, but also several synchronization primitives (locks, channel, etc.) and when doing so, regarding stack trace you are in an even worst situation than what async stack traces gives you (“some thread changed this shared-memory value and now it's not what you expected, but you have no easy way to know which one did and when, good luck”).

GordonS · on Feb 12, 2021

Maybe if you spray threads around at random :), but in real-world use I find it much easier to pinpoint where the problem occurred, and the path taken to get there. Also, at least with threads you can get the thread ID and/or name.

Regarding shared, mutable state - if multiple async "threads" can access that state, then you still need to guard it, but usually with an async-capable means.

littlestymaar · on Feb 12, 2021

> Regarding shared, mutable state - if multiple async "threads" can access that state, then you still need to guard it, but usually with an async-capable means.

Sometimes, but not as often, because the scope of your async function is often the only “shared state” you need.

pow_pp_-1_v · on Feb 12, 2021

async/await is for the concurrent stuff and threads are for the parallel stuff. Two different things. If your code is I/O-bound, use async/await. If your code is processor-bound thing, use threads.

GordonS · on Feb 12, 2021

Async/await paradigms exist in several languages, but with C#, async/await is generally considered the "modern" and unified way to handle both IO bound and CPU bound tasks.

The runtime will generally schedule IO bound tasks to run on the threadpool.

pow_pp_-1_v · on Feb 12, 2021

> The runtime will generally schedule IO bound tasks to run on the threadpool.

Well, that's not correct. Unless you explicitly call Task.Run or Task.Start (or other similar methods) no new thread is created. The compiler generated state machines don't require the threading mechanism to work. In fact the overhead for async/await is mostly the extra code generated for the state machine and error handling. At runtime, there's no thread switching overhead.

GordonS · on Feb 12, 2021

Yes, I meant using Task.Run; I was simplifying, as I'd assumed (wrongly) you were familiar with async/await from another language.

Otherwise, from memory, the runtime spec doesn't actually guarantee that await won't run on a threadpool thread - it will under certain circumstances.

And then there are further nuances if there is a synchronisation context and ConfigureAwait(false) is used, as the continuation will be scheduled on a threadpool thread.

wahern · on Feb 13, 2021

> Async programming also allows for structured programming, where each task is an owned resource of a parent tasks, which means that lifetimes of tasks can be controlled and runaway threads can't exist

You're confusing mechanism with semantics. Here's an article on how Java's Project Loom achieves structured concurrency using its new virtual threads capability: https://vorpus.org/blog/notes-on-structured-concurrency-or-g...

And the article cited above that proposes the idea of nurseries for structured concurrency: https://vorpus.org/blog/notes-on-structured-concurrency-or-g...

Arguably, structured concurrency as described above is easier to obtain when using threads as your underlying mechanism, because the vast majority of code is serial[1]. That there are a handful of critical regions where you want to express concurrency relationships doesn't mean we have to discard threads. That's throwing the baby out with the bath water.

Self-promotion: I had stumbled on the idea of "nurseries", independently and many years before the above were published. See https://github.com/wahern/cqueues It's nominally a non-blocking "threading" API for Lua. (In Lua coroutines are also called threads.) But note the plural, continuation queues. It's trivial to instantiate a queue, which is similar to a nursery. This was by design. Many cqueues projects naturally end up with a tree of thread controllers/schedulers. It doesn't work on Windows (yet) because it relies on the fact that kqueue, epoll, and Solaris Ports descriptors can be recursively polled.

[1] Serial != synchronous/blocking.

waynesonfire · on Feb 12, 2021

It has been a pleasure using the Erlang runtime to scratch my concurrency itch while avoiding the async / await bandaid.

reddit_clone · on Feb 12, 2021

Seemingly synchronous on the inside. Async on the outside. With (nearly/practically) unlimited processes/greenthreads.

Best combination of things I have come across.

Matthias247 · on Feb 12, 2021

> Async programming also allows for structured programming, where each task is an owned resource of a parent tasks, which means that lifetimes of tasks can be controlled and runaway threads can't exist (if one is avoiding tokio::spawn).

That's unfortunately far less reliable in practice than it seems on the first glance: You might never know whether any async function you call spawns something else, or makes use of `spawn_blocking`, `block_in_place` or any other function which isn't a pure state machine.

If you try to cancel any of those, you will get either excessive blocking or end up with runaway tasks.

A better solution for this is real support for structured concurrency, as available in Kotlin, Python Trio and now coming to Swift async functions. This doesn't really require immediate cancellation - as favored by Rust futures. It works better with cooperative cancellation, where cancellation is requested asynchronously and ongoing tasks are supposed (but not forced) to listen and follow the cancellation recommendation.

kzrdude · on Feb 12, 2021

That's interesting to hear - but how much of an investment is it to climb the that mountain (or hill) that makes you comfortable with working with the async model?

lxpz · on Feb 12, 2021

I'm not the best qualified to answer this question as I do spend a lot of time reading about programming languages in general, and even though I was able to grasp Rust's async/await very fast, I probably owe it to previous knowledge of a relatively large variety of programming paradigms. I'll thus rely on other commenters that seem to agree that it's really not as hard as you would expect. In particular Rust helps a lot in making sure you don't make too many mistakes so I'd wager that learning to do correct async/await in Rust is probably easier than in, say, Javascript.

pas · on Feb 12, 2021

Rust's async model takes very little time to grasp. It's very explicit. Nothing runs in the background (contrast that with NodeJS). You have strong static types to help you to know when you got a Future, you can decide where/when to await it.

It's programming with threads, where you have a thread pool and pipes to put tasks onto that and a helper function/macro. Await does this under the hood of course.

baby · on Feb 12, 2021

That is until you get a weird error or try to do something more complicated.

pas · on Feb 12, 2021

This is perfectly valid for NodeJS, Python (async and/or threads, hello GIL[0]), and a host of other languages/runtimes.

Also I agree that multi-threaded Rust is probably the best alternative to async Rust.

[0] I have no problems with the GIL, but it's yet another factor to consider when a python multi-threaded program stops working as intended

EugeneOZ · on Feb 12, 2021

Not sure how you can compare then. You can’t do with the threads what you can do with async functions.

pas · on Feb 13, 2021

?

You can do exactly the same thing, after all async is just a big auto-generated state-machine that puts jobs onto a thread pool and waits for them.

Nginx is hand crafted (artisanal!) event-driven C, which is exactly how async runtimes also work. A big loop (the event loop, usually an infinite `while` blocked on epoll()). For example NodeJS uses libuv for this.

The big advantage of first class async support is that we don't have to do this by hand. Plus it makes some optimizations easier (eg. putting things on the stack instead of assigning each event handler a slice of some global [heap allocated] structure).

Or maybe I'm simply misunderstanding what you meant. In that case cloud you clarify please?

TheNorthman · on Feb 12, 2021

Not OP but: I found it surprisingly manageable. I think the disconnect for many people is they think they'll understand it simply by using it [0]. For me, investing a short time reading some of the design articles/documents really helped it click.

[0]: Which is fair, I wouldn't be surprised if this was the best way for some.

gpderetta · on Feb 12, 2021

using threads (or, my preference, stackful coroutines) does not prevent you from using futures for pipelining and composing computations. But it avoids having to explicitly [1] chain continuations to wait on them.

[1] I count await as explicit as it forces the awkward top level only suspend model.

gameswithgo · on Feb 12, 2021

more expressive and powerful can also mean harder to wrap tour head around.

pron · on Feb 12, 2021

The throughput increase in I/O scenarios with many tasks is due to the number of supported concurrent processes and Little's law; it has little to do with context switching time, which has a negligible impact on the throughput in these use-cases: https://inside.java/2020/08/07/loom-performance/

Low context switch latency only matters when the number of tasks is very small (their data all fits in the cache), and the workload is entirely computational. Otherwise, even the fastest implementation is ~60 ns, which is the cost of a cache-miss, and the compiler can't optimise things into a simple goto because the dispatch goes through a scheduler that has a megamorphic call-site.

So memory is much more important for I/O use-case throughput, and while it is true that the kernel doesn't commit the full stack memory on thread creation, it's misleading to think that you get good memory usage. For one, once the memory is committed, it's never uncommitted (although it can be paged out). For another, the granularity is that of a page, i.e. at least 4K, which can often be much higher than what a task requires.

> It is hard to pin down exactly how the alleged advantages would arise.

For I/O use-cases the answer is here:

> the async version uses about 1/20th as much memory as the threaded version.

This could translate to 20x throughput -- due to Little's law -- although usually less because there are other limits, like network saturation.

PixelOfDeath · on Feb 12, 2021

I hat very good experience with using buffer variables to copy-"prefetch" unpredictable costly fetches. E.g. from cache lines that get touched by several cores for communication.

And I only actual use them after one iteration of whatever I do. So the core could fetch the memory content without actually having to stall because I do not use it until later.

I'm not sure how realistic that is inside a kernel thread scheduler, but it sure is useful in user space for task based libraries.

jsd1982 · on Feb 12, 2021

Please don't mix nanoseconds and microseconds. This is just confusing to read. Stick to nanoseconds for everything.

jstrong · on Feb 12, 2021

ah, a fellow traveler - godspeed. what a sane, reasonable world we could have if nanosecond timestamps ruled supreme.

makapuf · on Feb 12, 2021

I wouldn't be so extreme, providers of '0' keys would flourish

mst · on Feb 12, 2021

I occasionally wonder whether it would be easier to skim '1k us' than '1ms' though, providing everything was denominated in us.

Maybe I'll try it in a blog post one day and see what percentage of the comments consist of hurled fruit.

Animats · on Feb 12, 2021

That's a huge help. I only need about 20 threads in Rust, some of which are compute-bound. So involving "async" is totally the wrong tool for the job. Goodbye, Tokio.

masklinn · on Feb 12, 2021

> So involving "async" is totally the wrong tool for the job.

Sadly with so many things having gone async-first (or only) it’s become difficult not to end up with an async runtime anyway, or not be forced to use an async system. I wanted to build a small web-based tool for local, didn’t really find anything which was not async.

josephg · on Feb 12, 2021

I’d happily take an async-by-default world over a world where some APIs only exist through blocking calls. A classically threaded program can easily block on a future, but wrapping a blocking call in an otherwise asynchronous program is complicated, expensive and error prone work.

Ciantic · on Feb 12, 2021

Btw, have you read this: https://async.rs/blog/stop-worrying-about-blocking-the-new-a... async-std allows to run blocking calls without hoops rather efficiently:

   async fn read_to_string(path: impl AsRef<Path>) -> io::Result<String> {
       std::fs::read_to_string(path)
   }

It doesn't have await inside! My mind was blown as I saw that.

Freaky · on Feb 12, 2021

Read the first sentence.

> This blog post describes a proposed scheduler for async-std that did not end up being merged for several reasons.

I don't think it's a particularly good idea in the first place - it's basically an automatic watchdog-driven block_in_place(). It doesn't remove the problem of blocking in futures, it just limits the damage to the local task rather than blocking the entire executor.

That's fine in the simple case of future-per-task, but it's pretty common to be polling multiple futures concurrently within one, so it's not a general solution.

masklinn · on Feb 12, 2021

> should a task execute for too long, the runtime will automatically react by spawning a new executor thread taking over the current thread’s work.

That is a super interesting strategy, though obviously only works when you can « afford » a multithreaded scheduler.

Anyway I wonder how they manage this, signals?

Freaky · on Feb 12, 2021

Nothing so fancy.

Each worker thread runs in a loop executing a queue of jobs. On every iteration it sets an atomic progress flag to true.

The runtime in which it's contained polls its workers every 1-10ms, atomically swapping in false and checking to see if the previous value was also false - if so, it steals its task queue and spins up another worker to execute it.

https://github.com/async-rs/async-std/blob/ceba324bef9641d61...

josephg · on Feb 12, 2021

> though obviously only works when you can « afford » a multithreaded scheduler.

Yeah, for example in comparison actix-web only uses single threaded workers - one per core. Future in actix-web doesn’t have to be Send or Sync, and I think it’s incompatible with what async-std is doing here. That design is almost certainly one of the reasons actix-web tops phoronix

raggi · on Feb 12, 2021

It also really doesn't scale. It'll do fine on your average <10 core laptop, but once you get on a multi-package system you're going to find you're constantly thrashing memory because it is making disruptive scheduling decisions and your pooled tasks have poor context locality.

masklinn · on Feb 12, 2021

> A classically threaded program can easily block on a future, but wrapping a blocking call in an otherwise asynchronous program is complicated, expensive and error prone work.

It’s really not though, at least as long as the parameters and results are Send. For instance Tokio has a spawn_blocking which runs the function on one of the blocking threads it spawns on-demand specifically for that use.

Meanwhile « blocking on a future » requires adding and managing an entire async runtime and its interactions with the rest of the program, and locking up the runtime is a very real possibility.

josephg · on Feb 12, 2021

I understand the desire to stave off dependencies but managing an async runtime should only be a simple function call or two. How do you end up locking up the runtime with something like that?

throwaway81523 · on Feb 13, 2021

These days it is possible to eliminate almost all blocking calls in Linux apps. File opening was a long persisting one but io_uring fixes that. Async sockets and file i/o to already-open fd's have been around forever.

darthrupert · on Feb 12, 2021

You might like Zig's attitude towards this question. Async/sync decision is a single compile-time decision there. The jury's still out whether that's a good idea though.

_vvhw · on Feb 12, 2021

I recently wrote an IO abstraction over io_uring using Zig's async/await.

Here's how you would do a write()/fsync()/read() with this (https://github.com/coilhq/tigerbeetle/blob/beta/src/io.zig#L...):

  const offset: u64 = 0;
  const bytes_written = try io.write(fd, buffer_write[0..], offset);
  try io.fsync(fd);
  const bytes_read = try io.read(fd, buffer_read[0..], offset);

Other sync functions can use this asynchronous IO completion code in a synchronous style (as this snippet shows) and still get all the zero-syscall and asynchronous performance of io_uring. What this is actually doing under the hood is filling SQEs into io_uring's submission queue ring buffer and then later reading completion events off io_uring's completion queue ring buffer, so it's fully asynchronous in the I/O sense but this hasn't spilled out and leaked over into the control flow. The control flow is as it should be, nice and simple and synchronous.

Beyond this, Zig still allows you to explicitly indicate concurrency with the `async` keyword, for example if you wanted to run multiple async code paths concurrently.

But the crucial part is that Zig's async/await does not force function coloring on you to do all of this: https://youtu.be/zeLToGnjIUM

Pretty incredible on Zig's part to be able to pull this off. Huge kudos to Andrew Kelley. Also, thanks to Jens Axboe and io_uring, what you saw above was first-class single-threaded or thread-per-core, there's no threadpool doing that for you, it's pure ring buffer communication to the kernel and back, no context switches, no expensive coordination. Pure performance. There's never been a better time for Zig's colorless async/await. The combination with io_uring in the kernel is going to be explosive. It's a perfect storm.

jstrong · on Feb 12, 2021

mio and mio_httpc are options if you're the kind of person like me who finds 'async' worse than event loops.

lilyball · on Feb 12, 2021

The only reason you wanted to use “async” was because of micro-optimization on thread context switches?

IshKebab · on Feb 12, 2021

That's the most commonly given reason for using async/await so a lot of people assume threads are way more heavyweight than they actually are.

What other reason were you thinking of?

AaronFriel · on Feb 12, 2021

Ease of understanding multithreaded code and wait on results or perform standard control flow constructs in a multithreaded environment?

This is a great example in Node on useful combinators that with async await make it easy to express parallel programming concepts with familiar tools. No manual IPC, no fork/join child PID/thread ID handling, etc.

https://github.com/sindresorhus/promise-fun

The same abstractions (or many of them) exist in Rust, but I think the above is illustrative of the ways we can combine async object returning functions and then use await to hide the complexity of the state machines needed to drive them.

That this abstraction that makes code easy to read and write also performs better is the icing on the cake. The former prevents bugs and keeps code quality high, and that is worth much more.

IshKebab · on Feb 12, 2021

> Ease of understanding multithreaded code

Rust is not Javascript. Using threads is actually a lot simpler in Rust than async/await.

IgorPartola · on Feb 12, 2021

I don’t know about Rust but in every other language I’ve used threads were easy to use and understand, except when it came to some bits like signals, which at least on Linux are no longer a big problem. Main thread runs a hot loop to look for data to process, then hands it off on a queue to a worker thread out of a pool. That thread is then solely responsible for processing the event and passing the result either back to the main thread or to the next thread in the pipeline via the same queue mechanism. Last thread to handle the result or the exception frees the resources. It might not be ergonomic for all types of code but it certainly isn’t hard to understand what everything is doing and easy enough to debug since each thread can be tested individually to check its functionality.

AaronFriel · on Feb 12, 2021

You just described having the main thread have to wait on multiple threads to complete processing of data, worker threads handling signals and IPC and moving data between threads, and then some sort of shared signaling to ensure resources are freed.

So, code potentially laden with use after frees, double frees, shared and mutable data, and so on.

No offense to you, but I would be leery of trusting that code in any languages except a handful. Certainly not C/C++, and if it were written in Rust, I would hope it would use a thread combinator library and channels.

IgorPartola · on Feb 13, 2021

It certainly is easy to make a mess of it with C/C++. It can be done well and safely but there are no guard rails. I have written this code in C and trusted it to run as intended and it did. I wouldn’t stake human lives on it but that wasn’t my requirement at the time. Val grind and other code analysis tools certainly didn’t complain and I had no memory leaks. Rust didn’t exist at the time. One specific project wrangled about 1000 worker threads, a logging thread, a network server thread, a signal processing thread, and a main control thread to the tune of a very large number of requests per second on commodity hardware. In running it for I think 4 years I had one memory leak initially that Valgrind quickly found. Could probably write that service with a lot fewer LOCs today with a language like Rust of course and with all kinds of memory safety. But at the time it worked well. Oh and it had to do all kinds of fun low level networking stuff with elevated privileges so double danger :)

IshKebab · on Feb 12, 2021

Sure, threads are easy to understand. The difficult is when you get a concurrency bug but that can happen with single threaded async/await code anyway.

Also threads are definitely not easy to use in all languages. E.g. C++ gives you very little help (no channels for example), and JavaScript makes starting threads difficult and moving/sharing memory is limited to primitive arrays.

IgorPartola · on Feb 13, 2021

A queue implementation in C is easy to create and understand if you don’t have a library for it handy. Combined with a mutex and/or a spin lock and once you’ve grokked pthreads’ mental model you should have the primitives. But those are all guns that shoot both ways if you aren’t careful.

renox · on Feb 13, 2021

Threads easy to understand? Hmm, even the termination part? With all these old school network APIs?

justsomeuser · on Feb 12, 2021

I find async/await to be easier to use than alternatives. I see two categories:

- A. Async/await - compiler saves and resumes functions.

- B. Message based - Golang, Erlang, threads with messaging.

With category A, I can use my IDE to jump to every function that is called and easily follow the computation.

With category B, all of these connections happen at runtime with messages.

When you have a tree of tasks all which may save/resume many times, async/await it easier to understand than launching a thread per IO event.

jeremyjh · on Feb 12, 2021

This a confused notion. A useful way to think of Go and Erlang is that they automatically and transparently insert async/await each time you call a function that performs I/O. Messaging between different application tasks is completely orthogonal and can have use cases in languages with async/await as well.

justsomeuser · on Feb 12, 2021

I probably should have put the categories as:

A. Implicit messaging using the languages function syntax (async/await).

B. Direct messaging using a message passing feature of the runtime (Erlang, Golang)

Note: I mean "messaging" in the context of a single OS process, that possibly has many threads (so within a single language runtime).

Async/await is still implicit messaging, but it appears like a regular function call - which in my opinion is easier to understand. Using function args/return for input/output is something every developer already knows.

In contrast, Erlang and Golang require you to use some type of messaging feature in addition to functions.

> A useful way to think of Go and Erlang is that they automatically and transparently insert async/await each time you call a function that performs I/O

The part they are missing from async/await is the ability to easily get return values without messaging, and do this recursively for a large tree of functions.

E.g. getting a return value from `go x()` requires messaging, but with async/await you could do `const p = x(); const ret = (await p); // return value received at a later time with no messaging.`

Both of them will require you to create some type of messaging topology to return the values (which makes your program a mixture of (regular functions + messaging features) vs async/awaits "everything looks like a function").

jeremyjh · on Feb 12, 2021

> The part they are missing from async/await is the ability to easily get return values without messaging, and do this recursively for a large tree of functions.

No, they do not. In Elixir for example if I call:

      bytes = File.read!("filename.txt")

`bytes` will have the data returned from the function call immediately, with no need for message passing or awaiting the result. Under the hood, it is still asynchronous evented I/O. If I want to explicitly await for flow control reasons (await all of or one of multiple events) that is available in the stdlib in the `Task` module. E.g.

      t1 = Task.async(fn -> do_this_thing() end)
      t2 = Task.async(fn -> do_this_other_thing() end)

      Task.await_many([t1, t2])

You can accomplish most things without ever calling send/receive or writing your own gen_server etc.

justsomeuser · on Feb 17, 2021

Although this emulates async/await (AA), under the the "async" emulated functions is message passing that must keep track of the connection between requests and responses at runtime (E.g. with state mapping request ids to response ids).

I think the key issue is that the inputs and outputs are disconnected in the static program text (and only connected dynamically at runtime).

Two contexts that matter for understanding how a system transitions between states are:

1. Program editing/reading.

2. Runtime.

I think AA is superior for understanding the system as a whole in both these contexts, because at edit time the IDE jump to def/show all usages allows you to understand every function that will be called, and at runtime you can get a stack trace to understand where the current function came from, and where it is going.

With message passing runtimes, both 1 and 2 require extra mental models on the part of the programmer, because they also need to understand the network topology (which either is not possible statically, or requires extra tooling on top of functions).

Message passing breaks down your system into CSP's, which makes it easy to understand each sync process, but hard to understand the whole system, as the same program-writing-process that allowed you to break down your components is working against you when you need to put them together again to understand the whole system.

I could be wrong as I have not used modern IDE's or debugging tools with message passing runtimes lately.

justsomeuser · on Feb 12, 2021

I see, I did not know that.

Last time I used Erlang (pre-Elixir), the `bytes` example would require you to set up a request/response with a blocking `receive`.

monadic3 · on Feb 12, 2021

Sorry, what does tokio have to do with "async"? The default implementation uses posix threads, no?

Matthias247 · on Feb 12, 2021

It's a runtime for running lightweight tasks (`Future`s, async functions) on top of it. What is not async about it? And of course it still needs posix threads. The executor needs to run somewhere, and the only somewhere that an OS offers is a thread.

monadic3 · on Feb 12, 2021

Sure, but I didn't think anything about async functions implied running tasks. Isn't it just syntactic sugar over futures? You certainly don't need to use the tokio runtime in order to use async functions.

So, it's not clear why you'd abandon the async syntax just because you're compute bound.

monadic3 · on Feb 16, 2021

I'll admit it's a bit disconcerting not getting a reply but I'll get over it.

continuations · on Feb 12, 2021

> A context switch takes around 0.2µs between async tasks, versus 1.7µs between kernel threads. But this advantage goes away if the context switch is due to I/O readiness: both converge to 1.7µs.

This is a big surprise.

If you look at the Techempower web benchmark [1], the performance of actix-web is about 20x higher than that of Rocket.

The common explanation is that actix-web is async and hence much faster than Rocket which relies on kernel context switching.

But if Rust async and kernel thread has the same switch time as shown by this benchmark, then why is actix-web so much faster than Rocket?

[1] https://www.techempower.com/benchmarks/#section=data-r20&hw=...

jashmatthews · on Feb 12, 2021

Rocket's low performance could well be because the threadpool and DB connection pool is undersized: https://github.com/TechEmpower/FrameworkBenchmarks/blob/b891...

otabdeveloper4 · on Feb 12, 2021

A meaningless comparison. Linux, being a preemptively multitasking OS, switches thread contexts regardless of what you're running.

So the Rust async context switch is on top of the regular Linux context switch, not instead.

ithkuil · on Feb 12, 2021

When you're in sub microsecond time scales, preemption events are relatively rare.

bitcharmer · on Feb 12, 2021

I work in ultra-low latency space and agree with GP. This comparison makes no sense as OS-level context switch is completely different from a task-switch within the same native thread. The Rust ones from that benchmark are essentially fibers, not threads. You will see similar performance for switching fibers if well implemented in Java, C++ or other natively compiled language. This has nothing to do with Rust.

otabdeveloper4 · on Feb 12, 2021

I don't see your point.

"Linux thread context switch time" is a meaningless metric, since Linux will switch thread context regardless of what you choose to run on your computer.

Any "async" switches are additional overhead; you don't get to not have kernel preemption just because your Rust thread is now switching contexts "asyncly".

There are benefits to having an additional user-mode scheduling mechanism inside your kernel thread, but saving CPU cycles isn't one of them.

ithkuil · on Feb 12, 2021

> switches thread contexts regardless of what you're running

my point was that thread context switches caused by preemption happen at an entirely different time scale than the rate of context switches caused by syscalls (if the system is doing any meaningful level of IO)

continuations · on Feb 12, 2021

How does Rust async compare to Goroutine, Erlang threads, Javascript async, Java async in performance and memory usage? Is there any benchmarks for that?

uniquefine · on Feb 12, 2021

There were benchmarks and a discussion on this on reddit recently comparing goroutines to tokio. If I recall correctly tokio was slower than goroutines but if you set the right settings it could be almost as fast. https://www.reddit.com/r/rust/comments/lg0a7b/benchmarking_t...

redrobein · on Feb 12, 2021

Rust doesn't have a standard async runtime, so the question would be "how does tokio compare to goroutine, etc..." since that's the most popular one.

Looking at the techempower benchmarks, the projects using tokio generally outperform Go, Java, so I'm guessing it's on par or better.

Hypothetically, you could port goroutines exact behavior to rust and use that as your wanted to too.

scottlamb · on Feb 12, 2021

I'd think mostly similar. Goroutines are "stackful" coroutines, though, so their memory use will be higher. They have an interesting stack copying model, so I'm not sure if they require as many pages as POSIX threads do. (Having a "denser" memory space and no guard page requirement would mean you could use huge pages and thus have much less TLB pressure.)

sam0x17 · on Feb 12, 2021

And also Crystal fibers

monadic3 · on Feb 12, 2021

Which rust async runtime are you referring to?

continuations · on Feb 12, 2021

Say Tokio, the async runtime used in OP's article.

jgilias · on Feb 12, 2021

The discussion should feature prominently somewhere on top that the comparison is between the Tokio async runtime and Linux threads. Reason being, people not familiar with Rust may assume that the discussion applies to Rust async in general, when it doesn't necessarily. Sure, in practice Tokio is pretty close to being the de-facto async runtime in Rust. But it's not the only one, as Rust's async language constructs allow for different runtime implementations that may be optimized for different use cases.

kentonv · on Feb 12, 2021

I feel like this analysis is missing some more nuanced points about stack memory.

Yes, pages will only be allocated for a thread's stack when the thread actually uses them. However, the thread does not release said memory afterwards. The memory can only be reused by the same thread. If a thread ever once does something that temporarily allocates a bunch of stack space, then it forever consumes that space going forwards even when no longer needing it. If you have 10,000 threads and each one of them happens to, at some point in its lifetime, use 1MB of stack space and frees it, then you are now using 10GB of RAM on mostly-unused pages.

Now you might say "what on Earth would ever use 1MB of stack???", but the problem is, in normal programs with few threads, there's no problem with a function temporarily using a ton of stack, and so random things feel free to do so. Maybe some library call you make likes to allocate a temporary buffer on the stack and you don't even know it. There's also normally no problem with doing some deep recursion every now and then, so it happens. Often, stack allocation is data-dependent (e.g. recursive descend parsing). So if you try to strictly limit your stack space then you risk running into random stack overflows or maybe even security issues. And if you do find a limit that works, it's still probably much larger than the average usage, so you're still wasting a bunch of memory.

IIRC, Go uses segmented stacks to avoid this problem, but C/C++/Rust do not. (I think Rust tried to at one point, but later gave up on that because of the complexity?)

In contrast, async tasks only hold onto the memory they are actually using to store live data at any particular moment. If an async task invokes some deeply-nested function and uses a bunch of stack space, it doesn't really matter, because all the tasks are running on the same thread, so the next task to call that function uses the same pages rather than allocate new ones.

(There's actually a similar issue regarding heap space. Memory allocators that perform reasonably with multiple threads typically maintain per-thread freelists, so if you have lots and lots of threads, you end up with a bunch of free'd memory stuck in freelists. Though, some allocators, like the new tcmalloc, are starting to use per-core freelists instead, which may avoid this problem.)

amluto · on Feb 12, 2021

I’m curious how this looks on ARM64. (Sorry, Aarch64.). x86 context switches are overcomplicated and inherently slow.

lstamour · on Feb 12, 2021

Well, here's what it looked like on my MacBook Pro with M1/16GB:

M1-MBP async-brigade % time cargo run --release 500 tasks, 10000 iterations:

mean 761.403µs per iteration, stddev 8.929µs (1.522µs per task per iter)

cargo run --release 3.21s user 4.60s system 99% cpu 7.818 total

M1-MBP thread-brigade % time cargo run --release 500 tasks, 10000 iterations:

mean 787.149µs per iteration, stddev 67.289µs (1.574µs per task per iter)

cargo run --release 0.94s user 7.19s system 100% cpu 8.081 total

I ran it a few times and the numbers came up rather similar each time: async-brigade finished in 760.273µs-764.928µs while thread-brigade took 784.510µs-796.323µs.

As macOS doesn't have taskset, I can't easily set affinity. I tried to use the workaround documented elsewhere to use Xcode's Instruments to reduce the number of CPU cores but it would always re-enable itself at 8 cores, so that didn't work.

tarruda · on Feb 12, 2021

In the past couple of years I started to use a heavier functional style for my code.

What I noticed is that any syntactical benefits of async/await has a lesser impact when most of your application logic lives in pure functions, since you greatly reduce the amount of code in async functions.

When I started using async/await in JS 4-5 years ago I thought: "How could we have lived without this for so long?". These days I don't care much about it.

kohlerm · on Feb 12, 2021

To me it looks like the main advantage of async is memory usage, which is kind of expected because of the overhead of a thread. But if you do not need lots of thread it doesn't look like there is a huge benefit going async. Or do I miss something here?

mikeschmatz · on Feb 12, 2021

It was my conclusion as well. I only found async useful in situations when a service has to deal with a large number of incoming requests, e.g. web server

eptcyka · on Feb 12, 2021

I think the async benchmark could be faster still when pinned to a single core if a single threaded runtime was used, and possibly if a single-thread channel implementations were used, but then it's becoming a bit academic. Really, what async gives you is a programming style that's very similar to using blocking sockets, but allows one to achieve select() like performance when doing I/O. That, and it allows one to not have to have a special thread for timers or, even worse, a thread per timer, as that's hidden away by the async runtime implementation and just works.

erk__ · on Feb 12, 2021

It could be very interesting to see similar comparisons with other operating systems like FreeBSD with kqueue or DragonflyBSD with Light Weight Kernel Threads.

richardwhiuk · on Feb 12, 2021

The problem with threads is you need to correctly size your thread pool. That's hugely difficult if you have unknown lengths of blocking IO.

tictac-toe · on Feb 12, 2021

You can use a dynamically sized thread pool. E.g. remove a thread with a certain probability once it's idle for more than X seconds.

jstrong · on Feb 12, 2021

what kind of problem were you trying to solve that you found sizing a thread pool to be difficult? generally when I've worked on high performance server code I've been coding with a target machine in mind, so it's more a matter of mapping the thread pool size to the resources available on that machine. but I'm interested to hear about circumstances where it wouldn't be easy.

richardwhiuk · on Feb 14, 2021

If you have downstream nodes which may have large amounts of latency in some scenarios, then you may need a huge thread pool.

If you add a huge thread pool, and then those downstreams don't have a large latency, then you end up accepting a huge amount of work and then are CPU starved.

So in order to correctly size your thread pool, you need to understand all your downstream latency, and adapt to it.

Compared to an async runtime, which just handles this scenario, it's very painful.

Even if you get this roughly right, the scheduler is very unhappy when you have lots of threads - it tends to make incorrect scheduling decisions.

imtringued · on Feb 12, 2021

You have a threadpool with X threads. You dispatch Y tasks. X of them run for 5 minutes. That means the remaining Y-X tasks are delayed by 5 minutes despite low CPU utilization.

lamontcg · on Feb 12, 2021

https://en.wikipedia.org/wiki/Green_threads

https://en.wikipedia.org/wiki/Thread_(computing)#Threading_m...

gwbas1c · on Feb 12, 2021

TLDR: Async code will have much lower CPU utilization compared to threaded code. An async version of a program might run just as fast as a threaded one, but it will overall use less system resources. The threaded version will be easier to write.

You can also have lower RAM overhead per thread if you choose a smaller stack space. Many programs will run fine with a smaller stack space, BTW.

----

Years ago I had to build a load simulator in C#. The CTO looked at me and told me that it had to simulate 100,000 clients; thus it had to be async.

He arranged for me to have a very powerful computer to run the load simulator.

I originally wrote non-blocking code. The non-blocking code had very low load at 100,000 clients, but I hit a problem with a difficult-to-understand edge case.

Because we only had a weekend to do load testing, I refactored the load simulator to be threaded. It only took me 20 minutes or so. The problem with the difficult-to-understand edge case went away, but CPU usage went up dramatically.

We had to tune the .Net framework to use a much smaller stack space.

In the end, I was able to have 100,000 threads to run the load simulator. CPU usage and RAM usage were very high, but the load simulator ran fine.

If I had more time, I would have taken the time to understand the edge case and continue to use non-blocking code. Then the program would have used much less system resources, but ran just as fast.

scottlamb · on Feb 12, 2021

> You can also have lower RAM overhead per thread if you choose a smaller stack space.

No, this is about as low as it gets. As the author explained, "the kernel only allocates physical memory to a stack as the thread touches its pages, so the initial memory consumption of a thread in user space is actually only around 8kiB."

The smallest possible page size (on x86-64) is 4 KiB, and you can't share pages between thread stacks, [1] so you can't go below 4 KiB of physical memory usage per thread. I'm not exactly sure how the author got to 8 KiB; maybe they meant "for each userspace thread" rather than "memory used in userspace" and are counting kernel memory too. I'm pretty sure the kernel uses at least 4 KiB per userspace thread (for a stack of its own, among other overhead).

Green threads won't take you below 4 KiB either, for the same reason.

[1] Without some custom ABI that guards against stack overflow in a different way. Golang has a custom ABI (I'm not sure exactly if this is why), and interoperability with C suffers, so this isn't an approach I'd love for Rust.

electricshampo1 · on Feb 12, 2021

a related analysis showing epoll vs thread per request

https://www.slideshare.net/brendangregg/rxnetty-vs-tomcat-pe...

KirillPanov · on Feb 12, 2021

> But this advantage goes away if the context switch is due to I/O readiness

This is not at all a fair comparison unless you're using io_uring.

josephg · on Feb 12, 2021

Good point - It’d be very interesting to see how io_uring changes those numbers if anyone has some time to make a fork / PR!

_vvhw · on Feb 12, 2021

Not Rust, but you may be interested in colorless async io_uring using Zig: https://news.ycombinator.com/item?id=26111847

Also (very rough) benchmarks (take with a pinch of salt) comparing various styles of fs and network IO (blocking, epoll, io_uring) for C and Zig: https://github.com/coilhq/tigerbeetle/tree/master/demos/io_u...

sriku · on Feb 12, 2021

Wondering how these micro benchmarks would fare on FreeBSD. (Will post here if I end up doing it over the weekend)

juancampa · on Feb 12, 2021

This is very illuminating. Thank you. I'd love to see a third "column": processes, i.e. fork().

secondcoming · on Feb 12, 2021

Performance profiling on a laptop is largely pointless. There's too much stuff trying to conserve power by limiting performance.

bzbarsky · on Feb 12, 2021

Are your users going to be running your application on laptops? Will they have the same "conserve power by limiting performance" going on? If so, that is _exactly_ the environment you want to do performance work in, generally speaking.

johnsoft · on Feb 12, 2021

It's about having a consistent measurement baseline. Say you run your benchmark once, then thermal throttling kicks in, then you run it again, and it takes twice as long. Is your code actually slower now? Should I wait until the fan turns off before I run it again? That data is noisy and useless. Take your measurements on a server or desktop with sane thermals and a full-size fan.

If you speed things up by 10% on your server, they'll get 10% faster on your laptop as well.

bzbarsky · on Feb 12, 2021

Yes, you have to be very careful with measurements, I agree.

> If you speed things up by 10% on your server, they'll get 10% faster on your laptop as well.

Depends on the speedup and techniques to achieve it. For example, speeding things up via more parallelism can lead to wall-clock improvements on servers but not laptops, precisely because the latter just end up doing more thermal throttling....

Ideally, you want to measure both ideal hardware and actual-user-hardware; often speedups on one will not be visible on the other and vice versa.

jstrong · on Feb 12, 2021

generally speaking, the advantage of async io is strongest for high performance server applications, especially in regards to the cpu usage required relative to the amount of io stuff you can do. with that in mind, "users running your application on laptops" would not be the most common case.

bzbarsky · on Feb 12, 2021

Yes, if your app is a high performance server app, measure in that environment.

But user-facing apps (the sort people run on laptops, say) have async I/O as table stakes, really. It's not even about throughput or CPU cycles: it's about the fact that if you have I/O latency on any thread the user interacts with the user experience will be terrible.

Now in practice maybe that means "just make the I/O async, but the performance details of that don't really matter too much".

Anyway, the overall comment was about performance profiling in general, not just async I/O.

the8472 · on Feb 12, 2021

I have done benchmarking on a linux laptop. Once you disable turbo-boost you get quite consistent results for CPU-bound tasks at least.

bitcharmer · on Feb 12, 2021

You have to do so much more to be able to reliably measure events on the scale of nanos. You need to lock C-states, disable P-state driver, isolate CPUs, get rid of RCUs, affinitize your tasks, enable low-tick mode, skew hr ticks, make sure you use TSC clocksource, set the cpu governor, get rid of vmstat, set correct idle driver, disable audits, and watchdogs and much, much more.

the8472 · on Feb 12, 2021

If you want to instrument only a handful events, yes. But for microbenchmarks which you can run for many iterations to get min/max/stdev (such as the benchmarks in the article) it's much easier. Disabling turbo often is sufficient to lower the variance far enough that old and new code are clearly distinguishable.

bitcharmer · on Feb 12, 2021

It has nothing to do with instrumenting and everything to do with platform noise.

secondcoming · on Feb 15, 2021

+1 This guy profiles

jph · on Feb 12, 2021

Quickest summary: Rust async is >3x faster and lighter than Linux threads. This is a great accomplishment for Rust.

harikb · on Feb 12, 2021

Keep in mind that a new async task doesn't create a new thread. So yes, "not creating a new thread" is 3x faster than "creating a thread". If the app layer can context switch using language level constructs, and do co-operative switching, then yes, one gets the 3x benefit. imho, whether the async executor and scheduler is performant enough to manage the tasks is what one should worry about.

bitcharmer · on Feb 12, 2021

You're only the second commenter on this thread to notice this.

The benchmark compares fibers to threads and has little to do with Rust. You will see the same numbers for a fibers implementation in any natively compiled language like C or Java.

The title is completely misleading, especially for most people who are not aware of this important distinction.

scoutt · on Feb 12, 2021

I'm confused. If many async tasks are ran on a single thread, what the thread does when is blocked waiting for things to happen? Does it sleep? If so, a context switch takes place anyway. If not, what is the impact on GUI applications? If I have a main thread to manage my GUI, should I spin a new thread to run my async tasks?

A modern microcontroller/microprocessor is inherently event driven (for example, on ARM, at the very bottom of the call stack there is a wait-for-event (WFE) or wait-for-interrupt (WFI) instruction).

If async needs to be polled to run ("Futures are inert in Rust and make progress only when polled"[1]) this means my processor should be busy running these async tasks instead of waiting (WFE or WFI) as the result of a native call to one of the operating system functions (i.e. recv() on a socket). What is the impact on embedded battery-based systems?

[1] https://rust-lang.github.io/async-book/01_getting_started/02...

harikb · on Feb 12, 2021

polling is only explained as a logical thing. In reality the given task is only marked to be woken up later. The later being some other point, while the same OS thread executing something else, when the executor determines that the idling task can be woken up. "Waking" is nothing but the same OS thread now switching to execute whatever it is that it is waking up.

Main idea is that a 'scheduler/executor' at the runtime/language level that knows about the state of the program can (a) 'save' and 'restore' fewer things compared to an OS context switch. (b) co-operative stuff does not need to pay the cost of too many unnecessary pre-emptions

scoutt · on Feb 12, 2021

Thanks.

> polling is only explained as a logical thing

But there is the poll() function that returns either the result of the operation, or "pending". So it's more than logical. Correct? I mean, if I (or the executor) don't call poll() nothing happens...

> OS thread now switching to execute whatever it is that it is waking up.

This is what confuses me. As I see it (and I what I understand from reading), async/await splits a routine into a (very smart) state machine.

I assume that there is no magic underneath. I mean, I can do the same state machine by hand if I want to, under the constraints of what the OS makes available for me in what context switching regards (APIs for waiting and synchronizing).

For a (OS/native) thread that has to wait for data on a socket, you have (basically) two options: wait on recv() or poll recv() without timeout.

Waiting on recv() would block (so no other code of my thread can be executed while waiting), so I guess the state machine needs to poll on recv() (I believe this is what this[1] example does).

In order to no block my thread, the executor either spins its own thread, or has to wait for my thread to poll() it.

[1] https://rust-lang.github.io/async-book/02_execution/02_futur...

jstrong · on Feb 12, 2021

in rust, there is no built-in runtime, so it depends on which one you are using. the runtime (e.g. tokio) is responsible for polling the future.

for network io, behind the scenes this is most likely utilizing epoll system calls. epoll mitigates context switch problem in a few ways, mostly because there is only one stack context to notify about new io events, instead of many.

andrewchambers · on Feb 12, 2021

I read it as most people are over investing in async rust.

rudedogg · on Feb 12, 2021

What makes you think that? Looking at the summary it looks like async is better in every way.

scottlamb · on Feb 12, 2021

Better or equal in all the ways measured. But some things aren't measured, maybe because they're obvious to the author or because they're harder to quantify.

* Rust's async ecosystem [1] adds a lot of complexity over simple threaded code.

* Rust's async ecosystem doesn't interoperate as easily with C libraries written in a simple threaded way. (And it's debatable which interoperates more easily with C libraries written with a different event loop.)

* async tasks can't be preempted, so concurrency will fall off a cliff if they run on O(cpus) threads and involve long-running computations or accidental blocking.

I think it's reasonable to ask if these numbers are enough better to justify all that, particularly given the disappointing "this advantage goes away if the context switch is due to I/O readiness".

And to go back and argue pro-async for a moment, io_uring might eliminate that disappointing caveat.

Then again, on the pro-thread side, there's Google's interesting fibers model that might solve some of these performance issues. [2] Also, "~17µs for a new kernel thread" is the wrong number, since you can avoid that cost with a simple thread pool.

Personally I think some things are better written as async, but it's a mistake to impose it on everything. For example, if you're writing a web app in Rust, I think you're usually better off writing threaded request handlers and having a mechanism for them to interact with the async hyper code. The hyper code is better off as async because an Internet-facing server might have an enormous number of connections in keepalive state.

[1] or maybe I should say ecosystems, plural, given the current tokio vs async-std divide.

[2] https://lwn.net/Articles/826860/

andrewchambers · on Feb 12, 2021

async rust is more complex with a larger dependency tree and is harder to write.

You get a marginal to good benefit if you have a specific work load that I think most people don't really have.

mhh__ · on Feb 12, 2021

If you elide a bounds check from a function but still spend a billion cycles in a loop, you've made your code run ever so slightly faster but gained nothing in the big picture.

shmerl · on Feb 12, 2021

It sound to me like comparing apples and oranges though. Parallelism (threads) and concurrency (aysnc in Rust) are not the same thing and can be actually used in combination.

jashmatthews · on Feb 12, 2021

You can happily use pthreads for concurrency up to ~10k before reaching for async.

platinumrad · on Feb 12, 2021

Nobody asked for a bad summary.