I am not surprised that the cost of context switching due to I/O readiness can often be roughly equal between async tasks and kernel threads. Normal blocking I/O can be surprisingly efficient because of various factors, such as a reduced need for system calls.
Think about it this way—if you have a user-space thread which wakes up due to I/O readiness, then this means that the relevant kernel thread woke up from epoll_wait() or something similar. With blocking I/O, you call read(), and the kernel wakes up your thread when the read() completes. With non-blocking I/O, you call read(), get EAGAIN, call epoll_wait(), the kernel wakes up your thread when data is ready, and then you call read() a second time.
In both scenarios, you’re calling a blocking system call and waking up the thread later.
Of course, there are scenarios when epoll_wait() returns multiple events, which reduces the number of context switches. But the general result is that it’s not always easy to beat blocking I/O and kernel threads.
Google found that the main cost of context switching isn't really in the syscall boundary but in the task scheduling. That's why their linux fork has optional userspace scheduling of kernel threads with the switchto syscalls[0]. Essentially, if your thread already knows which thread should run next, it can context switch to it without having to schedule in the kernel, which is exactly the situation in these bucket brigade benchmarks.
This benchmark as written is probably underestimating kernel task scheduling cost since only 1 task is runnable at any 1 time, while a realistic multi-threaded system will have more runnable threads to juggle.
Yes, the point of "async" isn't to save CPU cycles, it's to customize the scheduler so that you can prioritize resource use properly.
(E.g., don't switch to the systemd or sshd thread if a customer's web request is timing out.)
That said, doing this right is out of reach of the average programmer, and it's doubtful that the compiler has enough domain-specific knowledge to do this automatically.
"Async" of the Python and node.js fame is yet another thing, a hack to get around their interpreters' inability to use kernel multitasking features because of global locks.
Linux is likely many years from having anything approaching a fully asynchronous system call interface, if anyone was willing to work on it (io_uring makes a huge dent but I don't think it's intending to reimplement everything). Even where async kernel interfaces exist, without reworking of the kernel-internal implementation still there is often the need for a thread for the kernel side to execute on. For example IIRC this is true for swathes of the vfs implementation at present.
So the whole thing is a bit of a false equivalence. Better interfaces that reduce context switches are desirable, but even where they exist often you are just substituting a user thread for a kernel one, and in the general case, there is likely to always be system interfaces that never make it into the brave new world -- take SysV IPC for example (a 1975 era API), it seems doubtful anyone would put the effort into making it async, but there will probably still be times where you might want to consume those interfaces for compatibility or some other obscure reason.
Also consider the case where a user program has a need for some substantial thread pools of its own, it might be the case in some scenarios that reusing resources that must already exist in user space and live in warmed caches makes more sense. Neither async or Linux threads are "better", it will always depend on a particular use case, and even then the right answer might well be some combination of both.
> io_uring makes a huge dent but I don't think it's intending to reimplement everything
At this point it's proponents are being pretty unapologic that it will, in fact, reimplement every part of the syscall interface that is actively used.
Then why is it that IO-heavy benchmarks such as the Techempower web benchmark are dominated by async frameworks? The fastest results there are all from async frameworks [1].
And among Rust frameworks the same pattern holds. The fastest Rust frameworks are async while a synchronous frmework such as Rocket is about 20x slower.
Those benchmarks measure one very specific scenario: serving lots of small requests concurrently. Async handles that well because that's exactly the scenario where a single epoll_wait() call will return lots of events.
Presumably the difference would be smaller or for some frameworks even negative if each request did some actual and not entirely predictable amount of CPU work (e.g. executing some html templating scenario with varying levels of output and perhaps compression), and just in general much more work and using more memory (so the memory overhead is proportionally less relevant), and if the benchmark implementations were not permitted to tune exactly for the workload and system (i.e. so that generalized scheduler defaults are used on both kernel and userspace side). I.e., in a more real-world scenario with all the normal complexities and inefficiencies and development time constraints that are usual.
But yeah, it's be super interesting to actually see that demonstrated - that'd be quite a lot of work, however.
I doubt you'll find anything as comprehensive and well-presented as the TechEmpower benchmarks, because their particular scenario is one that a lot of frameworks care about competing on (partly because it's difficult enough to be interesting). But I'd expect any benchmark for batch-style processing of large volumes of data would show that.
If your request are huge. For example, imagine you need to read many huge files into memory.
Whether you read one file after the other sequentially, or try to read all of them concurrently, won't make a difference, because your Disk/RAM bandwidth is going to be bottlenecked anyways.
Trying to do this concurrently requires more work that won't pay off, so it might actually be slower.
Benchmarks are not everything, and the difference between asynch/synchronous operation is not the only thing the benchmark is testing (each of these different frameworks appear to have their own system for parsing and representing HTTP requests). You should know what usage patterns YOUR application sees, understand the relative cost of engineering time and CPU time for YOUR application, and do tests in YOUR environment.
I'd argue that it's because even though blocking IO is cheaper, it's very difficult to maximise performance in a multithreaded/concurrent context.
You could make faster code with it but I wouldn't want to maintain it and you'd have to throw an obscene amount of man hours at it to get that performance.
I mostly agree with you (not the least of which is that blocking I/O is a damn fine API), but the reason that people use async I/O is to have lots of outstanding requests. Typically you would use select (or similar) to service whichever one responds first. That way you can multiplex many I/O streams onto a small number of threads. If threads are memory-intensive, you almost certainly have to do this.
Well, it can, but not always. Remember that if you’re waiting for an event to arrive, that generally involves a syscall, the thread being put to sleep, and then the thread being woken up. Any time you’re doing that, think, “Could I just replace this polling system with a call to read()?”
What io_uring does do is provide a way to poll without needing to wait, but if you haven’t received new events when you poll, you’re not on the fast path any more. Whether you are often on the fast path for io_uring will depend on the particulars of your application and its I/O patterns.
> What io_uring does do is provide a way to poll without needing to wait, but if you haven’t received new events when you poll, you’re not on the fast path any more.
Isn't "not on the fast path any more" a bit absolutist? io_uring's "slow" path is roughly one syscall per iteration, right? That's still many fewer syscalls than one syscall per IO operation (or more if any return EAGAIN/EWOULDBLOCK) as you'd be doing without it. I'm not sure I really care about eliminating that last syscall per iteration; it seems minor in comparison.
Where 1 iteration and 1 syscall to io_uring_enter() is submitting 100s of I/O operations per io_uring_enter() syscall (and you can even run the ring buffers with the kernel set to poll so you can do 0 syscalls if that's not already enough).
> Where 1 iteration and 1 syscall to io_uring_enter() is submitting 100s of I/O operations per io_uring_enter() syscall (and you can even run the ring buffers with the kernel set to poll so you can do 0 syscalls if that's not already enough).
Same for the blocking case. If I do a syscall to read a whole file, its just 1 syscall creating millions of I/O operations.
Sure, but that's all you'll ever do with the blocking case: 1 syscall at a time, while your program sits and does nothing with the CPU, whereas with io_uring at least you can do CPU while you wait on your IO. So even ignoring the IORING_SQPOLL option that requires no io_uring_enter() syscall, a basic usage of io_uring is still going to be faster.
io_uring is a bicycle for IO, and you can ride it as fast as you want to. But it's apples and oranges to blocking IO, which is always stuck in first gear.
> while your program sits and does nothing with the CPU
The CPU can run other threads while the hardware does DMA transfers. The thread just yields when the transfer is started, and a hardware exception wakes it up when the DMA transfer finishes.
Sure, but we're comparing the efficiency of one of your program's single threads, because otherwise you could take that same argument you just used and turn it around and say fine, just run another thread then with another io_uring... and you're still ahead. You have to compare at the smallest unit of control plane.
At the same time, multiple threads for a single program introduce context switches which are becoming horrendously expensive compared to the sheer number of IOPS that modern NVMe SSDs can do.
Thread-per-core designs built around io_uring are the future of IO on Linux.
io_uring’s slow path is making one blocking syscall every time you would ordinarily make a blocking syscall.
I am a bit baffled how this could possibly be considered an “absolutist” viewpoint—I am just saying that there exist scenarios where io_uring is not helpful. This should be uncontroversial.
> io_uring’s slow path is making one blocking syscall every time you would ordinarily make a blocking syscall.
That's not correct, io_uring was "absolutely" designed, at least in the technical sense, for zero syscalls in the slow path (if you want to):
IORING_SETUP_SQPOLL
When this flag is specified, a kernel thread is created to perform submission queue polling.
An io_uring instance configured in this way enables an application to issue I/O without ever
context switching into the kernel. By using the submission queue to fill in new submission
queue entries and watching for completions on the completion queue, the application can submit
and reap I/Os without doing a single system call.
You mean when there's only one thing to do per iteration? I'd describe that as when mostly idle. As the system gets more loaded, the one syscall per iteration matters less and less.
io_uring is specifically designed so that zero system calls are necessary while the system is busy. Userspace and the kernel both update ring buffers, and ring buffers can be checked and drained without entering the kernel at all.
io_uring doesn't mean async though. You can also use it for blocking batch execution of syscalls. E.g. when you need to stat hundreds of files or wait for several child processes at once. So with some batch-oriented convenience wrappers it can help threaded code too.
> People often see that there's some theoretical benefit of async and then they accept far less ergonomic coding styles and the additional bug classes that only happen on async due to accidental blocking etc... despite the fact that when you consider a real-world deployed application, those "benefits" become indistinguishable from noise. However, due to the additional bug classes and worse ergonomics, there is now less energy for actually optimizing the business logic, which is where all of the cycles and resource use are anyway, so in-practice async implementations tend to be buggier and slower.
I disagree with this. I feel like using Async programming is actually much more powerfull and expressive than theaded programming, especially with Rust combinators on streams of futures (for examples: futures_unordered), which allow to trivially express complex concurrency patterns (such as: wait for the first two requests to return something and discard the third request's response, and btw also cancel that request). Async programming also allows for structured programming, where each task is an owned resource of a parent tasks, which means that lifetimes of tasks can be controlled and runaway threads can't exist (if one is avoiding tokio::spawn). I've been developping [Garage](https://git.deuxfleurs.fr/Deuxfleurs/garage) for some time now (a simple distributed object store that implements a subset of S3, not ready for production!), and I've been in awe about how easy it was to write these complex patterns using async Rust.
The way you have quoted this suggests it is being said by the author, whom you are disagreeing with. It's not: it's being said by someone else (in an issue filed against the repo), and the author also mostly disagrees.
Sorry, yes this is not from the mouth of the author, however he seemed to agree with the premise that async/await is unergonomical and that performance is its only reason to exist, which I am trying to dispute (at least in the context of how Rust does it, which is much better than the JS version for instance)
Indeed. I read it as something written by the author. Double-checking revealed it was written by spacejam that did post the same argumentation over and over here on HN.
It seems to me that these are two orthogonal topics. One thing is how you represent tasks, either using OS threads or async tasks. And the other is how you structure concurrency. Maybe I'm missing something, but I think there is nothing preventing the use of those structured concurrency patterns using OS threads as the base for tasks. Then you get some nice benefits of doing this such as proper stack-traces and easier debugging.
The killer use case for async tasks is when you need hyper-concurrency, e.g. hundreds of thousands of concurrent tasks. In that case, as the article mentions, you can't use OS threads anymore. Of course there are some use cases requiring this level of concurrency, messaging servers come to my mind, but there are also many, many use cases were you need a lower level of concurrency, like a few hundred concurrent tasks max. In those cases I think using OS threads can work pretty well with less complexity.
The advantage of async tasks for structured concurrency lies in task cancellation, which is intrinsically linked to the notion of "task ownership". If you are using an OS thread to offload some task, and then realize that you don't need that task's result anymore, your safest bet is to let the thread run until the end and then discard the results it produces. Other options include adding custom cancellation logic to the thread and remembering to call it at the appropriate time. Nobody checks that you are doing this correctly, which means you may leak resources such as the thread's memory or a TCP connection. On the other hand when using async/await in Rust, the fact of owning a future (i.e. owning the promise that will return you the value when it's done) implies ownership of the task's resources, such as memory, file descriptors, or TCP connections. Dropping the future before it completes means that the task will stop and all resources will be freed/closed immediately, and this is checked statically by the compiler.
POSIX thread cancellation has existed with defined (though complex) semantics for ages. It's a ginormous ugly mess, but it is an alternative to run-to-completion or custom logic.
Anywhere you have an .await in async code you could have a checkpoint in a thread that allows for cancellation. That's the main cancellation advantage - that the author is forced to write those to consume other async functions.
So one of the things I realized writing Erlang is that when concurrency is 'free' (or so close as to be indistinguishable in most use cases), more things end up being easy to write concurrently than we traditionally think.
An instance I ran into personally was, effectively, task scheduling. Sure, I could have done the 'normal' thing, of a priority queue being populated from the database on some interval, having some thread reading from that queue, sleeping until the first item needs work, pulling it off, throwing it onto a threadpool. Have to take care to ensure the threadpool is large enough for the maximum amount of concurrency I need, have to make sure that I'm careful in what data structure I use for the priority queue (I need to make sure I'm not adding the same task multiple times to it, and that when adding items to it I'm not locking it), make sure the polling thread can't throw (or at least, when it does, it restarts or kills the program and that then restarts), a few other niggles here and there too. And a whole 'nother level of complexity if tasks lead to follow up tasks (i.e., a task represents a state machine through a series of transitions, which themselves take a sizable amount of time, to where just leaving them on the thread is a bad idea, since it uses up the threadpool).
In a 'free concurrency' world, I just spin up a new concurrent process per task for some window (same as how many items I added to the priority queue). And that's basically it. Each process can step through its state machine, sleeping in between tasks for however long, without issue.
I think the more important aspect of that quote is just about performance vs. code. I see many cases where people are hyperoptimizing on whether or not their framework consumes 2 or 15 microseconds per request when the work they are going to do takes 100 milliseconds.
If you like the async style better, then fine, use it. Sometimes you win like that, where the thing you like better is also faster. But don't worry so much about the performance.
Web frameworks is another place I see this a lot. Crossing the streams, if you've got an incoming web request, unless your framework somehow consumes and discards the web headers, a real web request is already many kilobytes just to represent the incoming headers by the time it gets to your handling code. Using async because it has ~200 bytes per task vs a thread allocating 10K out of the box at that point doesn't make much difference because the HTTP request itself is blowing out the difference.
The spread in orders of magnitude in what is expensive and what is not has gotten so significant on modern systems that you can easily get developers sitting there optimizing nanoseconds while throwing away seconds. The old school assembly-style premature optimization where we're trying to save every bit and cycle has mostly passed away, but its replacement seems to be this; frantically benchmarking how many millions of requests per second some framework or feature can handle as if it matters when your code is going to take 500ms.
A lot of web requests are tens of millliseconds due to the latency of speaking to the database. The abolity to fire several requests to the database instead of serialising them is one of this optimisations that you really can’t do in threaded model without introducing other asynchronous systems components.
you have a good point about http request size.
But any high performance framework would not buffer the whole request to later parse it. it will do incremental parsing.
Meaning you don’t need to read all the bytes from the tcp socket before deciding which route to take.
And also the handler for that route is given a stream object and will just read as many byte as it need.
Speaking of futures_unordered and similar patterns, I think a part of the "async promise" that has failed is the lack of concurrency for a single user request by default in most languages.
That is, the 'easy' path is to write code such as the following (in vaguely C# pseudocode):
var p = await GetUserPermission( username );
var c = await GetServerConfig();
var m = await GetMessageOfTheDay();
Assume each await call is potentially an expensive SQL query or REST API call.
The problem with that is that this is strictly sequential, synchronous code that is merely "dehydrated" and "rehydrated" to reduce overheads during the waiting periods. It is strictly slower when executed on a server that is not very busy! It must be, because it does the exact same work in the exact same order as the ordinary synchronous version, except now with extra state machinery and complex error handling woven throughout by the compiler.
Scalability is not everyone's concern. Scalability is for the FAANG sized companies. I care about the individual user experience, and async does nothing for that by default.
I mean, sure, you can write much more verbose code along the lines of:
var p_t = GetUserPermission(username);
var c_t = GetServerConfig();
var m_t = GetMessageOfTheDay();
await Task.WhenAll(new Task[] { p_t, c_t, m_t });
var p = p_t.Result;
var c = c_t.Result;
var m = m_t.Result;
But noone does this, for some values of noone. I've never seen code like this in the field.
In fact, let's test this. I'm reviewing an asynchronous ASP.NET application developed in 2020 right now. It's a large app, with literally thousands of uses of the "await" keyword, at least 3500 files use it.
The only use of "Task" static methods are seven uses of FromResult(). That's it. Zero uses of WaitAll(), WaitAny(), or ContinueWith()!
This is typical.
It's not that asynchronous programming is hard, it's that it is unergonomic to gain a latency benefit out of it. Most applications need lower latency, not higher throughput. Hence, for most programmers, most of the time, asynchronous programming is next to useless. It's just extra noise and more failure modes.
That .NET syntax using Task.WhenAll seems quite bad, which might be part of the reason why not many people bother (disclaimer: I don't do C# or ASP.NET). In Rust it would be:
let p, c, m = join!(
GetUserPermission(username),
GetServerConfig(),
GetMessageOftheDay()
);
(you don't even have to write await when using the join macro)
With such simple syntax available it seems obvious to me that one would want to use it as often as possible, and it's also much simpler (and probably cheaper) than dispatching those three tasks to a thread pool.
The original example by me did not assume that asynchronous functions all return the same result type.
Mist opportunities for concurrency are between unrelated tasks (because related tasks often dependencies between them). Unrelated tasks tend to have unrelated return types.
When tasks are unrelated, you also likely don't need them all at the same time for the next stage of pipeline. You can simply await it when you need it.
var p_t = GetUserPermission(username);
var c_t = GetServerConfig();
var m_t = GetMessageOfTheDay();
function_to_call1(await p_t, await c_t);
function_to_call2(await m_t);
This does not look any more complicated than a non-async function. Not sure how this example justifies your claims.
Besides, even if your example is valid, the usage of Task.WhenAll has nothing to do with your claims either. The use of async/await is majorly for scalability. Being able to make several network calls concurrently is not the major concern. Even if you await at each async call, you still achieve better scalability because threads won't be blocked for async calls and can work on something else.
I guess I am that no one. I come at all this from writing queues from scratch and using threads or processes for concurrency. I also had a lot of fun writing my own networking hot loops with select/poll/spill/kqueue when my work needed it, so I guess I am extra sensitive to making concurrent things actually concurrent. But I would not dream of making three independent requests like that sequentially. There are other patterns you can use besides waiting for all tasks to finish, especially if you can do some processing after the first are done, but all in all why wouldn’t you make them concurrent aside from liking seeing await/async all over the place?
All you have to do is wrap multiple futures into a single one and then await on the combined one. There is no programming language on earth that can prevent this.
My teams/company uses it all over, so maybe depends on the context you work in?
And FWIW, this explicit form is often unnecessary - if you kick off each task they will run in parallel and then just await each task only when the result is needed, it can look a lot cleaner:
var p_t = GetUserPermission(username);
var c_t = GetServerConfig();
var m_t = GetMessageOfTheDay();
var foo = isAuthorized(await p_t);
// more code here
var msg = ( (await c_t).ServerName + await m_t) );
True, this doesn't work in Rust though, because nothing at all happens before the first time you poll a future, so you need an explicit task (but as other pointed out, it's pretty straightforward thanks to the `join!` macro).
I write this kind of stuff all the time because parallelizing long-running tasks without dependencies is one of the easiest wins when it comes to wall-time.
But this kind of optimization is somewhat orthogonal to async/await. You don't need fine-grained async to optimize long-running tasks, you could just throw a bunch of closures into a threadpool for that purpose. Async only makes sense when you're interleaving thousands of tasks with readiness/completion based IO.
It's rare that I personally write these kind of optmisations in web apps, but quite often do for backend processing services (and then, only for "embarrassingly async" operations such as hitting a database or HTTP API).
You don't need to create a Task[] because WhenAll is set up for varargs. This is fine:
await Task.WhenAll(p_t, c_t, m_t);
Or you can just await the threads before you need them. They're already started and running at this point.
You also probably want to avoid using Result and just await the completed task for the nicer unwrap syntax. Plus, you don't want to get into the habit of using Result as its a blocking call. Same with WaitAll and WaitAny. Ideally you would never use those. ContinueWith is also not very needed if your style is to use the more plain await syntax. Those methods are more to bridge blocking and async code so an async from the start app might use async extensively and never those methods.
I have written a few programs like this - but not in languages which have async/await! In languages with manual async, getting here by refactoring is fairly easy.
I've been using C# for around 20 years, basically since it was first released.
I never personally had any issue with working with threads and locks, finding it simple enough to reason about them, though I understand lots of people felt differently. When async/await first came to C# around 10 years ago, I grumbled because I didn't see the point; I found it much harder to reason about the flow of code, and initially at least, stack traces were a shitshow (things are much improved, but there is still a lot of cruft in async stack traces).
But async/await was heavily pushed, and "real" threading is almost relegated to the sidelines for most developers. Although having said that, I find that junior devs in particular really struggle to really grok async/await.
Anyway, several more years on, and I have mixed feelings about async. Because Microsoft has gone all-in on async/await, I think it's really easy to work with when building web apps and APIs with ASP.NET Core/MVC - there is barely any "developer overhead" at all, really. Web apps very often hit things like HTTP APIs and databases, and with how easy it now is, there is little reason not to use async/await. Yes, for small loads there is a tiny performance loss due to the runtime setting up async state machines, but it really is almost always completely insignificant - even moreso with the advent of ValueTask, and again more recently with pooled ValueTasks. Yet the gains can be tremendous.
But for non-web apps/APIs, I feel differently. I spend a lot of time writing server-side processing services, and things like Windows services for desktops (in the infosec space), and I've gone all-in on async/await because Microsoft has gone async-first. Hell, a lot of stuff is async only now, so unless you want `.GetAwaiter().GetResult()` everywhere, you have little choice. Anyway, these systems are more complex than web apps, because with web apps, most of the real complexity is hidden away in the framework. But here you have to deal with work queues, caching, pooling, serialisation etc all by yourself. And with async/await, it can be hard to reason about the flow of code, and it's really easy to break things in ways that are really painful to diagnose. And it means that every.single.stacktrace contains async cruft that you need to sift through. Which is not fun.
Anyway, this is much longer than I meant, but my conclusion is that I'll continue to use async/await for web apps and REST APIs (because, why not), but for services, I'm going back to the threadpool, green threads and synchronization primitives, and only using async/await in a limited way where it provides clear value - not async all the way down from the entrypoint.
> Anyway, this is much longer than I meant, but my conclusion is that I'll continue to use async/await for web apps and REST APIs (because, why not), but for services, I'm going back to the threadpool, green threads and synchronization primitives, and only using async/await in a limited way where it provides clear value - not async all the way down from the entrypoint.
AFAIK, .Net doesn't support "green threads" and they repeatedly confirmed that there are no future plans to do so. Additionally, M:N threading model has serious interop issues as evident in Go, which is a no-go for system languages. Personally, I don't see a need for green threads since kernel threads are fast enough and don't use that much RAM as people tend to believe. And when they are not, sure, go async/await.
I find your comment about stack traces a bit weird: of course, when all your work is sequential and you can use only threads, you will have a nice stack trace for free, when async stack traces need a lot of support from the tooling.
But most of the time you not only use thread, but also several synchronization primitives (locks, channel, etc.) and when doing so, regarding stack trace you are in an even worst situation than what async stack traces gives you (“some thread changed this shared-memory value and now it's not what you expected, but you have no easy way to know which one did and when, good luck”).
Maybe if you spray threads around at random :), but in real-world use I find it much easier to pinpoint where the problem occurred, and the path taken to get there. Also, at least with threads you can get the thread ID and/or name.
Regarding shared, mutable state - if multiple async "threads" can access that state, then you still need to guard it, but usually with an async-capable means.
> Regarding shared, mutable state - if multiple async "threads" can access that state, then you still need to guard it, but usually with an async-capable means.
Sometimes, but not as often, because the scope of your async function is often the only “shared state” you need.
async/await is for the concurrent stuff and threads are for the parallel stuff. Two different things. If your code is I/O-bound, use async/await. If your code is processor-bound thing, use threads.
Async/await paradigms exist in several languages, but with C#, async/await is generally considered the "modern" and unified way to handle both IO bound and CPU bound tasks.
The runtime will generally schedule IO bound tasks to run on the threadpool.
> The runtime will generally schedule IO bound tasks to run on the threadpool.
Well, that's not correct. Unless you explicitly call Task.Run or Task.Start (or other similar methods) no new thread is created. The compiler generated state machines don't require the threading mechanism to work. In fact the overhead for async/await is mostly the extra code generated for the state machine and error handling. At runtime, there's no thread switching overhead.
Yes, I meant using Task.Run; I was simplifying, as I'd assumed (wrongly) you were familiar with async/await from another language.
Otherwise, from memory, the runtime spec doesn't actually guarantee that await won't run on a threadpool thread - it will under certain circumstances.
And then there are further nuances if there is a synchronisation context and ConfigureAwait(false) is used, as the continuation will be scheduled on a threadpool thread.
> Async programming also allows for structured programming, where each task is an owned resource of a parent tasks, which means that lifetimes of tasks can be controlled and runaway threads can't exist
Arguably, structured concurrency as described above is easier to obtain when using threads as your underlying mechanism, because the vast majority of code is serial[1]. That there are a handful of critical regions where you want to express concurrency relationships doesn't mean we have to discard threads. That's throwing the baby out with the bath water.
Self-promotion: I had stumbled on the idea of "nurseries", independently and many years before the above were published. See https://github.com/wahern/cqueues It's nominally a non-blocking "threading" API for Lua. (In Lua coroutines are also called threads.) But note the plural, continuation queues. It's trivial to instantiate a queue, which is similar to a nursery. This was by design. Many cqueues projects naturally end up with a tree of thread controllers/schedulers. It doesn't work on Windows (yet) because it relies on the fact that kqueue, epoll, and Solaris Ports descriptors can be recursively polled.
> Async programming also allows for structured programming, where each task is an owned resource of a parent tasks, which means that lifetimes of tasks can be controlled and runaway threads can't exist (if one is avoiding tokio::spawn).
That's unfortunately far less reliable in practice than it seems on the first glance: You might never know whether any async function you call spawns something else, or makes use of `spawn_blocking`, `block_in_place` or any other function which isn't a pure state machine.
If you try to cancel any of those, you will get either excessive blocking or end up with runaway tasks.
A better solution for this is real support for structured concurrency, as available in Kotlin, Python Trio and now coming to Swift async functions. This doesn't really require immediate cancellation - as favored by Rust futures. It works better with cooperative cancellation, where cancellation is requested asynchronously and ongoing tasks are supposed (but not forced) to listen and follow the cancellation recommendation.
That's interesting to hear - but how much of an investment is it to climb the that mountain (or hill) that makes you comfortable with working with the async model?
I'm not the best qualified to answer this question as I do spend a lot of time reading about programming languages in general, and even though I was able to grasp Rust's async/await very fast, I probably owe it to previous knowledge of a relatively large variety of programming paradigms. I'll thus rely on other commenters that seem to agree that it's really not as hard as you would expect. In particular Rust helps a lot in making sure you don't make too many mistakes so I'd wager that learning to do correct async/await in Rust is probably easier than in, say, Javascript.
Rust's async model takes very little time to grasp. It's very explicit. Nothing runs in the background (contrast that with NodeJS). You have strong static types to help you to know when you got a Future, you can decide where/when to await it.
It's programming with threads, where you have a thread pool and pipes to put tasks onto that and a helper function/macro. Await does this under the hood of course.
You can do exactly the same thing, after all async is just a big auto-generated state-machine that puts jobs onto a thread pool and waits for them.
Nginx is hand crafted (artisanal!) event-driven C, which is exactly how async runtimes also work. A big loop (the event loop, usually an infinite `while` blocked on epoll()). For example NodeJS uses libuv for this.
The big advantage of first class async support is that we don't have to do this by hand. Plus it makes some optimizations easier (eg. putting things on the stack instead of assigning each event handler a slice of some global [heap allocated] structure).
Or maybe I'm simply misunderstanding what you meant. In that case cloud you clarify please?
Not OP but: I found it surprisingly manageable. I think the disconnect for many people is they think they'll understand it simply by using it [0]. For me, investing a short time reading some of the design articles/documents really helped it click.
[0]: Which is fair, I wouldn't be surprised if this was the best way for some.
using threads (or, my preference, stackful coroutines) does not prevent you from using futures for pipelining and composing computations. But it avoids having to explicitly [1] chain continuations to wait on them.
[1] I count await as explicit as it forces the awkward top level only suspend model.
The throughput increase in I/O scenarios with many tasks is due to the number of supported concurrent processes and Little's law; it has little to do with context switching time, which has a negligible impact on the throughput in these use-cases: https://inside.java/2020/08/07/loom-performance/
Low context switch latency only matters when the number of tasks is very small (their data all fits in the cache), and the workload is entirely computational. Otherwise, even the fastest implementation is ~60 ns, which is the cost of a cache-miss, and the compiler can't optimise things into a simple goto because the dispatch goes through a scheduler that has a megamorphic call-site.
So memory is much more important for I/O use-case throughput, and while it is true that the kernel doesn't commit the full stack memory on thread creation, it's misleading to think that you get good memory usage. For one, once the memory is committed, it's never uncommitted (although it can be paged out). For another, the granularity is that of a page, i.e. at least 4K, which can often be much higher than what a task requires.
> It is hard to pin down exactly how the alleged advantages would arise.
For I/O use-cases the answer is here:
> the async version uses about 1/20th as much memory as the threaded version.
This could translate to 20x throughput -- due to Little's law -- although usually less because there are other limits, like network saturation.
I hat very good experience with using buffer variables to copy-"prefetch" unpredictable costly fetches. E.g. from cache lines that get touched by several cores for communication.
And I only actual use them after one iteration of whatever I do. So the core could fetch the memory content without actually having to stall because I do not use it until later.
I'm not sure how realistic that is inside a kernel thread scheduler, but it sure is useful in user space for task based libraries.
That's a huge help. I only need about 20 threads in Rust, some of which are compute-bound. So involving "async" is totally the wrong tool for the job. Goodbye, Tokio.
> So involving "async" is totally the wrong tool for the job.
Sadly with so many things having gone async-first (or only) it’s become difficult not to end up with an async runtime anyway, or not be forced to use an async system. I wanted to build a small web-based tool for local, didn’t really find anything which was not async.
I’d happily take an async-by-default world over a world where some APIs only exist through blocking calls. A classically threaded program can easily block on a future, but wrapping a blocking call in an otherwise asynchronous program is complicated, expensive and error prone work.
> This blog post describes a proposed scheduler for async-std that did not end up being merged for several reasons.
I don't think it's a particularly good idea in the first place - it's basically an automatic watchdog-driven block_in_place(). It doesn't remove the problem of blocking in futures, it just limits the damage to the local task rather than blocking the entire executor.
That's fine in the simple case of future-per-task, but it's pretty common to be polling multiple futures concurrently within one, so it's not a general solution.
Each worker thread runs in a loop executing a queue of jobs. On every iteration it sets an atomic progress flag to true.
The runtime in which it's contained polls its workers every 1-10ms, atomically swapping in false and checking to see if the previous value was also false - if so, it steals its task queue and spins up another worker to execute it.
> though obviously only works when you can « afford » a multithreaded scheduler.
Yeah, for example in comparison actix-web only uses single threaded workers - one per core. Future in actix-web doesn’t have to be Send or Sync, and I think it’s incompatible with what async-std is doing here. That design is almost certainly one of the reasons actix-web tops phoronix
It also really doesn't scale. It'll do fine on your average <10 core laptop, but once you get on a multi-package system you're going to find you're constantly thrashing memory because it is making disruptive scheduling decisions and your pooled tasks have poor context locality.
> A classically threaded program can easily block on a future, but wrapping a blocking call in an otherwise asynchronous program is complicated, expensive and error prone work.
It’s really not though, at least as long as the parameters and results are Send. For instance Tokio has a spawn_blocking which runs the function on one of the blocking threads it spawns on-demand specifically for that use.
Meanwhile « blocking on a future » requires adding and managing an entire async runtime and its interactions with the rest of the program, and locking up the runtime is a very real possibility.
I understand the desire to stave off dependencies but managing an async runtime should only be a simple function call or two. How do you end up locking up the runtime with something like that?
These days it is possible to eliminate almost all blocking calls in Linux apps. File opening was a long persisting one but io_uring fixes that. Async sockets and file i/o to already-open fd's have been around forever.
You might like Zig's attitude towards this question. Async/sync decision is a single compile-time decision there. The jury's still out whether that's a good idea though.
Other sync functions can use this asynchronous IO completion code in a synchronous style (as this snippet shows) and still get all the zero-syscall and asynchronous performance of io_uring. What this is actually doing under the hood is filling SQEs into io_uring's submission queue ring buffer and then later reading completion events off io_uring's completion queue ring buffer, so it's fully asynchronous in the I/O sense but this hasn't spilled out and leaked over into the control flow. The control flow is as it should be, nice and simple and synchronous.
Beyond this, Zig still allows you to explicitly indicate concurrency with the `async` keyword, for example if you wanted to run multiple async code paths concurrently.
But the crucial part is that Zig's async/await does not force function coloring on you to do all of this: https://youtu.be/zeLToGnjIUM
Pretty incredible on Zig's part to be able to pull this off. Huge kudos to Andrew Kelley. Also, thanks to Jens Axboe and io_uring, what you saw above was first-class single-threaded or thread-per-core, there's no threadpool doing that for you, it's pure ring buffer communication to the kernel and back, no context switches, no expensive coordination. Pure performance. There's never been a better time for Zig's colorless async/await. The combination with io_uring in the kernel is going to be explosive. It's a perfect storm.
Ease of understanding multithreaded code and wait on results or perform standard control flow constructs in a multithreaded environment?
This is a great example in Node on useful combinators that with async await make it easy to express parallel programming concepts with familiar tools. No manual IPC, no fork/join child PID/thread ID handling, etc.
The same abstractions (or many of them) exist in Rust, but I think the above is illustrative of the ways we can combine async object returning functions and then use await to hide the complexity of the state machines needed to drive them.
That this abstraction that makes code easy to read and write also performs better is the icing on the cake. The former prevents bugs and keeps code quality high, and that is worth much more.
I don’t know about Rust but in every other language I’ve used threads were easy to use and understand, except when it came to some bits like signals, which at least on Linux are no longer a big problem. Main thread runs a hot loop to look for data to process, then hands it off on a queue to a worker thread out of a pool. That thread is then solely responsible for processing the event and passing the result either back to the main thread or to the next thread in the pipeline via the same queue mechanism. Last thread to handle the result or the exception frees the resources. It might not be ergonomic for all types of code but it certainly isn’t hard to understand what everything is doing and easy enough to debug since each thread can be tested individually to check its functionality.
You just described having the main thread have to wait on multiple threads to complete processing of data, worker threads handling signals and IPC and moving data between threads, and then some sort of shared signaling to ensure resources are freed.
So, code potentially laden with use after frees, double frees, shared and mutable data, and so on.
No offense to you, but I would be leery of trusting that code in any languages except a handful. Certainly not C/C++, and if it were written in Rust, I would hope it would use a thread combinator library and channels.
It certainly is easy to make a mess of it with C/C++. It can be done well and safely but there are no guard rails. I have written this code in C and trusted it to run as intended and it did. I wouldn’t stake human lives on it but that wasn’t my requirement at the time. Val grind and other code analysis tools certainly didn’t complain and I had no memory leaks. Rust didn’t exist at the time. One specific project wrangled about 1000 worker threads, a logging thread, a network server thread, a signal processing thread, and a main control thread to the tune of a very large number of requests per second on commodity hardware. In running it for I think 4 years I had one memory leak initially that Valgrind quickly found. Could probably write that service with a lot fewer LOCs today with a language like Rust of course and with all kinds of memory safety. But at the time it worked well. Oh and it had to do all kinds of fun low level networking stuff with elevated privileges so double danger :)
Sure, threads are easy to understand. The difficult is when you get a concurrency bug but that can happen with single threaded async/await code anyway.
Also threads are definitely not easy to use in all languages. E.g. C++ gives you very little help (no channels for example), and JavaScript makes starting threads difficult and moving/sharing memory is limited to primitive arrays.
A queue implementation in C is easy to create and understand if you don’t have a library for it handy. Combined with a mutex and/or a spin lock and once you’ve grokked pthreads’ mental model you should have the primitives. But those are all guns that shoot both ways if you aren’t careful.
This a confused notion. A useful way to think of Go and Erlang is that they automatically and transparently insert async/await each time you call a function that performs I/O. Messaging between different application tasks is
completely orthogonal and can have use cases in languages with async/await as well.
A. Implicit messaging using the languages function syntax (async/await).
B. Direct messaging using a message passing feature of the runtime (Erlang, Golang)
Note: I mean "messaging" in the context of a single OS process, that possibly has many threads (so within a single language runtime).
Async/await is still implicit messaging, but it appears like a regular function call - which in my opinion is easier to understand. Using function args/return for input/output is something every developer already knows.
In contrast, Erlang and Golang require you to use some type of messaging feature in addition to functions.
> A useful way to think of Go and Erlang is that they automatically and transparently insert async/await each time you call a function that performs I/O
The part they are missing from async/await is the ability to easily get return values without messaging, and do this recursively for a large tree of functions.
E.g. getting a return value from `go x()` requires messaging, but with async/await you could do `const p = x(); const ret = (await p); // return value received at a later time with no messaging.`
Both of them will require you to create some type of messaging topology to return the values (which makes your program a mixture of (regular functions + messaging features) vs async/awaits "everything looks like a function").
> The part they are missing from async/await is the ability to easily get return values without messaging, and do this recursively for a large tree of functions.
No, they do not. In Elixir for example if I call:
bytes = File.read!("filename.txt")
`bytes` will have the data returned from the function call immediately, with no need for message passing or awaiting the result. Under the hood, it is still asynchronous evented I/O. If I want to explicitly await for flow control reasons (await all of or one of multiple events) that is available in the stdlib in the `Task` module. E.g.
Although this emulates async/await (AA), under the the "async" emulated functions is message passing that must keep track of the connection between requests and responses at runtime (E.g. with state mapping request ids to response ids).
I think the key issue is that the inputs and outputs are disconnected in the static program text (and only connected dynamically at runtime).
Two contexts that matter for understanding how a system transitions between states are:
1. Program editing/reading.
2. Runtime.
I think AA is superior for understanding the system as a whole in both these contexts, because at edit time the IDE jump to def/show all usages allows you to understand every function that will be called, and at runtime you can get a stack trace to understand where the current function came from, and where it is going.
With message passing runtimes, both 1 and 2 require extra mental models on the part of the programmer, because they also need to understand the network topology (which either is not possible statically, or requires extra tooling on top of functions).
Message passing breaks down your system into CSP's, which makes it easy to understand each sync process, but hard to understand the whole system, as the same program-writing-process that allowed you to break down your components is working against you when you need to put them together again to understand the whole system.
I could be wrong as I have not used modern IDE's or debugging tools with message passing runtimes lately.
It's a runtime for running lightweight tasks (`Future`s, async functions) on top of it. What is not async about it? And of course it still needs posix threads. The executor needs to run somewhere, and the only somewhere that an OS offers is a thread.
Sure, but I didn't think anything about async functions implied running tasks. Isn't it just syntactic sugar over futures? You certainly don't need to use the tokio runtime in order to use async functions.
So, it's not clear why you'd abandon the async syntax just because you're compute bound.
> A context switch takes around 0.2µs between async tasks, versus 1.7µs between kernel threads. But this advantage goes away if the context switch is due to I/O readiness: both converge to 1.7µs.
This is a big surprise.
If you look at the Techempower web benchmark [1], the performance of actix-web is about 20x higher than that of Rocket.
The common explanation is that actix-web is async and hence much faster than Rocket which relies on kernel context switching.
But if Rust async and kernel thread has the same switch time as shown by this benchmark, then why is actix-web so much faster than Rocket?
I work in ultra-low latency space and agree with GP.
This comparison makes no sense as OS-level context switch is completely different from a task-switch within the same native thread. The Rust ones from that benchmark are essentially fibers, not threads. You will see similar performance for switching fibers if well implemented in Java, C++ or other natively compiled language. This has nothing to do with Rust.
"Linux thread context switch time" is a meaningless metric, since Linux will switch thread context regardless of what you choose to run on your computer.
Any "async" switches are additional overhead; you don't get to not have kernel preemption just because your Rust thread is now switching contexts "asyncly".
There are benefits to having an additional user-mode scheduling mechanism inside your kernel thread, but saving CPU cycles isn't one of them.
> switches thread contexts regardless of what you're running
my point was that thread context switches caused by preemption happen at an entirely different time scale than the rate of context switches caused by syscalls (if the system is doing any meaningful level of IO)
How does Rust async compare to Goroutine, Erlang threads, Javascript async, Java async in performance and memory usage? Is there any benchmarks for that?
There were benchmarks and a discussion on this on reddit recently comparing goroutines to tokio.
If I recall correctly tokio was slower than goroutines but if you set the right settings it could be almost as fast.
https://www.reddit.com/r/rust/comments/lg0a7b/benchmarking_t...
I'd think mostly similar. Goroutines are "stackful" coroutines, though, so their memory use will be higher. They have an interesting stack copying model, so I'm not sure if they require as many pages as POSIX threads do. (Having a "denser" memory space and no guard page requirement would mean you could use huge pages and thus have much less TLB pressure.)
The discussion should feature prominently somewhere on top that the comparison is between the Tokio async runtime and Linux threads. Reason being, people not familiar with Rust may assume that the discussion applies to Rust async in general, when it doesn't necessarily. Sure, in practice Tokio is pretty close to being the de-facto async runtime in Rust. But it's not the only one, as Rust's async language constructs allow for different runtime implementations that may be optimized for different use cases.
I feel like this analysis is missing some more nuanced points about stack memory.
Yes, pages will only be allocated for a thread's stack when the thread actually uses them. However, the thread does not release said memory afterwards. The memory can only be reused by the same thread. If a thread ever once does something that temporarily allocates a bunch of stack space, then it forever consumes that space going forwards even when no longer needing it. If you have 10,000 threads and each one of them happens to, at some point in its lifetime, use 1MB of stack space and frees it, then you are now using 10GB of RAM on mostly-unused pages.
Now you might say "what on Earth would ever use 1MB of stack???", but the problem is, in normal programs with few threads, there's no problem with a function temporarily using a ton of stack, and so random things feel free to do so. Maybe some library call you make likes to allocate a temporary buffer on the stack and you don't even know it. There's also normally no problem with doing some deep recursion every now and then, so it happens. Often, stack allocation is data-dependent (e.g. recursive descend parsing). So if you try to strictly limit your stack space then you risk running into random stack overflows or maybe even security issues. And if you do find a limit that works, it's still probably much larger than the average usage, so you're still wasting a bunch of memory.
IIRC, Go uses segmented stacks to avoid this problem, but C/C++/Rust do not. (I think Rust tried to at one point, but later gave up on that because of the complexity?)
In contrast, async tasks only hold onto the memory they are actually using to store live data at any particular moment. If an async task invokes some deeply-nested function and uses a bunch of stack space, it doesn't really matter, because all the tasks are running on the same thread, so the next task to call that function uses the same pages rather than allocate new ones.
(There's actually a similar issue regarding heap space. Memory allocators that perform reasonably with multiple threads typically maintain per-thread freelists, so if you have lots and lots of threads, you end up with a bunch of free'd memory stuck in freelists. Though, some allocators, like the new tcmalloc, are starting to use per-core freelists instead, which may avoid this problem.)
Well, here's what it looked like on my MacBook Pro with M1/16GB:
M1-MBP async-brigade % time cargo run --release
500 tasks, 10000 iterations:
mean 761.403µs per iteration, stddev 8.929µs (1.522µs per task per iter)
cargo run --release 3.21s user 4.60s system 99% cpu 7.818 total
M1-MBP thread-brigade % time cargo run --release
500 tasks, 10000 iterations:
mean 787.149µs per iteration, stddev 67.289µs (1.574µs per task per iter)
cargo run --release 0.94s user 7.19s system 100% cpu 8.081 total
I ran it a few times and the numbers came up rather similar each time: async-brigade finished in 760.273µs-764.928µs while thread-brigade took 784.510µs-796.323µs.
As macOS doesn't have taskset, I can't easily set affinity. I tried to use the workaround documented elsewhere to use Xcode's Instruments to reduce the number of CPU cores but it would always re-enable itself at 8 cores, so that didn't work.
In the past couple of years I started to use a heavier functional style for my code.
What I noticed is that any syntactical benefits of async/await has a lesser impact when most of your application logic lives in pure functions, since you greatly reduce the amount of code in async functions.
When I started using async/await in JS 4-5 years ago I thought: "How could we have lived without this for so long?". These days I don't care much about it.
To me it looks like the main advantage of async is memory usage, which is kind of expected because of the overhead of a thread. But if you do not need lots of thread it doesn't look like there is a huge benefit going async. Or do I miss something here?
It was my conclusion as well. I only found async useful in situations when a service has to deal with a large number of incoming requests, e.g. web server
I think the async benchmark could be faster still when pinned to a single core if a single threaded runtime was used, and possibly if a single-thread channel implementations were used, but then it's becoming a bit academic. Really, what async gives you is a programming style that's very similar to using blocking sockets, but allows one to achieve select() like performance when doing I/O. That, and it allows one to not have to have a special thread for timers or, even worse, a thread per timer, as that's hidden away by the async runtime implementation and just works.
It could be very interesting to see similar comparisons with other operating systems like FreeBSD with kqueue or DragonflyBSD with Light Weight Kernel Threads.
what kind of problem were you trying to solve that you found sizing a thread pool to be difficult? generally when I've worked on high performance server code I've been coding with a target machine in mind, so it's more a matter of mapping the thread pool size to the resources available on that machine. but I'm interested to hear about circumstances where it wouldn't be easy.
If you have downstream nodes which may have large amounts of latency in some scenarios, then you may need a huge thread pool.
If you add a huge thread pool, and then those downstreams don't have a large latency, then you end up accepting a huge amount of work and then are CPU starved.
So in order to correctly size your thread pool, you need to understand all your downstream latency, and adapt to it.
Compared to an async runtime, which just handles this scenario, it's very painful.
Even if you get this roughly right, the scheduler is very unhappy when you have lots of threads - it tends to make incorrect scheduling decisions.
You have a threadpool with X threads. You dispatch Y tasks. X of them run for 5 minutes. That means the remaining Y-X tasks are delayed by 5 minutes despite low CPU utilization.
TLDR: Async code will have much lower CPU utilization compared to threaded code. An async version of a program might run just as fast as a threaded one, but it will overall use less system resources. The threaded version will be easier to write.
You can also have lower RAM overhead per thread if you choose a smaller stack space. Many programs will run fine with a smaller stack space, BTW.
----
Years ago I had to build a load simulator in C#. The CTO looked at me and told me that it had to simulate 100,000 clients; thus it had to be async.
He arranged for me to have a very powerful computer to run the load simulator.
I originally wrote non-blocking code. The non-blocking code had very low load at 100,000 clients, but I hit a problem with a difficult-to-understand edge case.
Because we only had a weekend to do load testing, I refactored the load simulator to be threaded. It only took me 20 minutes or so. The problem with the difficult-to-understand edge case went away, but CPU usage went up dramatically.
We had to tune the .Net framework to use a much smaller stack space.
In the end, I was able to have 100,000 threads to run the load simulator. CPU usage and RAM usage were very high, but the load simulator ran fine.
If I had more time, I would have taken the time to understand the edge case and continue to use non-blocking code. Then the program would have used much less system resources, but ran just as fast.
> You can also have lower RAM overhead per thread if you choose a smaller stack space.
No, this is about as low as it gets. As the author explained, "the kernel only allocates physical memory to a stack as the thread touches its pages, so the initial memory consumption of a thread in user space is actually only around 8kiB."
The smallest possible page size (on x86-64) is 4 KiB, and you can't share pages between thread stacks, [1] so you can't go below 4 KiB of physical memory usage per thread. I'm not exactly sure how the author got to 8 KiB; maybe they meant "for each userspace thread" rather than "memory used in userspace" and are counting kernel memory too. I'm pretty sure the kernel uses at least 4 KiB per userspace thread (for a stack of its own, among other overhead).
Green threads won't take you below 4 KiB either, for the same reason.
[1] Without some custom ABI that guards against stack overflow in a different way. Golang has a custom ABI (I'm not sure exactly if this is why), and interoperability with C suffers, so this isn't an approach I'd love for Rust.
Are your users going to be running your application on laptops? Will they have the same "conserve power by limiting performance" going on? If so, that is _exactly_ the environment you want to do performance work in, generally speaking.
It's about having a consistent measurement baseline. Say you run your benchmark once, then thermal throttling kicks in, then you run it again, and it takes twice as long. Is your code actually slower now? Should I wait until the fan turns off before I run it again? That data is noisy and useless. Take your measurements on a server or desktop with sane thermals and a full-size fan.
If you speed things up by 10% on your server, they'll get 10% faster on your laptop as well.
Yes, you have to be very careful with measurements, I agree.
> If you speed things up by 10% on your server, they'll get 10% faster on your laptop as well.
Depends on the speedup and techniques to achieve it. For example, speeding things up via more parallelism can lead to wall-clock improvements on servers but not laptops, precisely because the latter just end up doing more thermal throttling....
Ideally, you want to measure both ideal hardware and actual-user-hardware; often speedups on one will not be visible on the other and vice versa.
generally speaking, the advantage of async io is strongest for high performance server applications, especially in regards to the cpu usage required relative to the amount of io stuff you can do. with that in mind, "users running your application on laptops" would not be the most common case.
Yes, if your app is a high performance server app, measure in that environment.
But user-facing apps (the sort people run on laptops, say) have async I/O as table stakes, really. It's not even about throughput or CPU cycles: it's about the fact that if you have I/O latency on any thread the user interacts with the user experience will be terrible.
Now in practice maybe that means "just make the I/O async, but the performance details of that don't really matter too much".
Anyway, the overall comment was about performance profiling in general, not just async I/O.
You have to do so much more to be able to reliably measure events on the scale of nanos. You need to lock C-states, disable P-state driver, isolate CPUs, get rid of RCUs, affinitize your tasks, enable low-tick mode, skew hr ticks, make sure you use TSC clocksource, set the cpu governor, get rid of vmstat, set correct idle driver, disable audits, and watchdogs and much, much more.
If you want to instrument only a handful events, yes. But for microbenchmarks which you can run for many iterations to get min/max/stdev (such as the benchmarks in the article) it's much easier. Disabling turbo often is sufficient to lower the variance far enough that old and new code are clearly distinguishable.
Keep in mind that a new async task doesn't create a new thread. So yes, "not creating a new thread" is 3x faster than "creating a thread". If the app layer can context switch using language level constructs, and do co-operative switching, then yes, one gets the 3x benefit. imho, whether the async executor and scheduler is performant enough to manage the tasks is what one should worry about.
You're only the second commenter on this thread to notice this.
The benchmark compares fibers to threads and has little to do with Rust. You will see the same numbers for a fibers implementation in any natively compiled language like C or Java.
The title is completely misleading, especially for most people who are not aware of this important distinction.
I'm confused. If many async tasks are ran on a single thread, what the thread does when is blocked waiting for things to happen? Does it sleep? If so, a context switch takes place anyway. If not, what is the impact on GUI applications? If I have a main thread to manage my GUI, should I spin a new thread to run my async tasks?
A modern microcontroller/microprocessor is inherently event driven (for example, on ARM, at the very bottom of the call stack there is a wait-for-event (WFE) or wait-for-interrupt (WFI) instruction).
If async needs to be polled to run ("Futures are inert in Rust and make progress only when polled"[1]) this means my processor should be busy running these async tasks instead of waiting (WFE or WFI) as the result of a native call to one of the operating system functions (i.e. recv() on a socket). What is the impact on embedded battery-based systems?
polling is only explained as a logical thing. In reality the given task is only marked to be woken up later. The later being some other point, while the same OS thread executing something else, when the executor determines that the idling task can be woken up. "Waking" is nothing but the same OS thread now switching to execute whatever it is that it is waking up.
Main idea is that a 'scheduler/executor' at the runtime/language level that knows about the state of the program can (a) 'save' and 'restore' fewer things compared to an OS context switch. (b) co-operative stuff does not need to pay the cost of too many unnecessary pre-emptions
But there is the poll() function that returns either the result of the operation, or "pending". So it's more than logical. Correct? I mean, if I (or the executor) don't call poll() nothing happens...
> OS thread now switching to execute whatever it is that it is waking up.
This is what confuses me. As I see it (and I what I understand from reading), async/await splits a routine into a (very smart) state machine.
I assume that there is no magic underneath. I mean, I can do the same state machine by hand if I want to, under the constraints of what the OS makes available for me in what context switching regards (APIs for waiting and synchronizing).
For a (OS/native) thread that has to wait for data on a socket, you have (basically) two options: wait on recv() or poll recv() without timeout.
Waiting on recv() would block (so no other code of my thread can be executed while waiting), so I guess the state machine needs to poll on recv() (I believe this is what this[1] example does).
In order to no block my thread, the executor either spins its own thread, or has to wait for my thread to poll() it.
in rust, there is no built-in runtime, so it depends on which one you are using. the runtime (e.g. tokio) is responsible for polling the future.
for network io, behind the scenes this is most likely utilizing epoll system calls. epoll mitigates context switch problem in a few ways, mostly because there is only one stack context to notify about new io events, instead of many.
Better or equal in all the ways measured. But some things aren't measured, maybe because they're obvious to the author or because they're harder to quantify.
* Rust's async ecosystem [1] adds a lot of complexity over simple threaded code.
* Rust's async ecosystem doesn't interoperate as easily with C libraries written in a simple threaded way. (And it's debatable which interoperates more easily with C libraries written with a different event loop.)
* async tasks can't be preempted, so concurrency will fall off a cliff if they run on O(cpus) threads and involve long-running computations or accidental blocking.
I think it's reasonable to ask if these numbers are enough better to justify all that, particularly given the disappointing "this advantage goes away if the context switch is due to I/O readiness".
And to go back and argue pro-async for a moment, io_uring might eliminate that disappointing caveat.
Then again, on the pro-thread side, there's Google's interesting fibers model that might solve some of these performance issues. [2] Also, "~17µs for a new kernel thread" is the wrong number, since you can avoid that cost with a simple thread pool.
Personally I think some things are better written as async, but it's a mistake to impose it on everything. For example, if you're writing a web app in Rust, I think you're usually better off writing threaded request handlers and having a mechanism for them to interact with the async hyper code. The hyper code is better off as async because an Internet-facing server might have an enormous number of connections in keepalive state.
[1] or maybe I should say ecosystems, plural, given the current tokio vs async-std divide.
If you elide a bounds check from a function but still spend a billion cycles in a loop, you've made your code run ever so slightly faster but gained nothing in the big picture.
It sound to me like comparing apples and oranges though. Parallelism (threads) and concurrency (aysnc in Rust) are not the same thing and can be actually used in combination.
Think about it this way—if you have a user-space thread which wakes up due to I/O readiness, then this means that the relevant kernel thread woke up from epoll_wait() or something similar. With blocking I/O, you call read(), and the kernel wakes up your thread when the read() completes. With non-blocking I/O, you call read(), get EAGAIN, call epoll_wait(), the kernel wakes up your thread when data is ready, and then you call read() a second time.
In both scenarios, you’re calling a blocking system call and waking up the thread later.
Of course, there are scenarios when epoll_wait() returns multiple events, which reduces the number of context switches. But the general result is that it’s not always easy to beat blocking I/O and kernel threads.