Great article. Can confirm, writing performance focused C# is fun. It's great having the convenience of async, LINQ, and GC for writing non-hot path "control plane" code, then pulling out Vector<T>, Span<T>, and so on for the hot path.
One question, how portable are performance benefits from tweaks to memory alignment? Is this something where going beyond rough heuristics (sequential access = good, order of magnitude cache sizes, etc) requires knowing exactly what platform you're targeting?
Author here. First of all, thanks for the compliment! It’s tough to get myself to write these days, so any motivation is appreciated.
And yes, once all the usual tricks have been exhausted, the nest step is looking at the cache/cache line sizes of the exact CPU you’re targeting and dividing the workload into units that fit inside the (lowest level possible) cache, so it’s always hot. And if you’re into this stuff, then you’re probably aware of cache-oblivious algorithms[0] as well :)
Personally, I almost never had the need to go too far into platform-specific code (except SIMD, of course), doing all the stuff in the post is 99% of the way there.
And yeah, C# is criminally underrated, I might write a post comparing high-perf code in C++ and C# in the future.
>> C# has an awesome situation in here with its support for value types (ref structs), slices (spans), stack allocation, SIMD intrinsics (including AVX512!). You can even go bare-metal and GC-free with bflat.
There's been a really solid effort by the maintainers to improve performance in C# , especially with regard to keeping stuff off the heap. I think it's a fantastic language for doing backends in. It's unfortunate that one of the big language users, Unity, has not yet updated to the modern runtime.
One other trick I use reasonably often is using something more complicated than AoS or SoA layouts. Reasons vary (the false sharing padding in your article is one example), but cache lines are another good one. You might, e.g., want an AoSoA structure to keep the SoA portion of things on a single cache line if you know you'll always need both data elements (the entire struct), want to pack as much data in a cache line as possible, and also want that data to be aligned.
No. This is far beyond the capabilities of current AI, and will remain so for the foreseeable future. You could let your model of choice churn on this for months, and you will not get anywhere. It will be able to reach a somewhat working solution quickly, but it will soon reach a point where for every issue it fixes, it introduces one or more issues or regressions. LLMs are simply not capable of scaffolding complexity like a human, and lack the clarity and rigorousness of thought required to execute an *extremely* ambitious project like performant CUDA to ROCm translation.
I don't think it really is, especially not if it's turned into a system, with multiple prompts, verification, etc.
Humans have problems with IMO problems, and this kind of kernel translation is a problem which is easier to humans, where there's more probably actually more data and a problem where the system can get feedback by simply running it and measuring memory use, runtime etc.
It'd be a system and no one has developed it, but I think it can be done with present LLMs as a core mechanism. They just need to be trained with RL on this specific problem.
Anyone with a good LLM, from Google to Mistral could probably do this, but it'd be a project.
I have a tiny Hetzner VPS (2 vCPUs, 2 Gb RAM) in their west us datacenter that costs me $5.59 a month. I get 1 TB a month free outgoing bandwidth and unlimited incoming bandwidth, plus additional outgoing bandwidth at a rate of $1.20 per TB. I host my personal project's git LFS server there, a file server, and a Caddy instance that proxies over Tailscale to a more powerful box in my apartment. It's a great homelab architecture and I couldn't be happier with it. Thanks Hetzner!
They still have VPSes with 2G of RAM? I'm checking the cloud price page* and you can get CX23 with 2 vCPU, 4G RAM and 20TB of traffic (seems to say that it's 1TB for US) for 3,49€/month (~4 USD).
You can save additional 0,50€ if you go with only IPv6.
Maybe it's location dependent, I can't get it to show me prices for US.
Hi there, At the top of the website, you should be able to choose euros or dollars and the VAT rate. When people become customers and create a full customer account, we ask them to choose between euros or dollars, and then using the customer's location from their account, we automatically apply the correct VAT. So customers see the correct price on the interface where customers create their orders. But you may have to set it manually on the website. --Katie
Yep, just checked, 2 GB RAM. That CX23 sounds like a great deal, 20 TB of free outgoing is ridiculous. But I live in west us and I rarely hit my 1 TB free bandwidth cap anyway, so the added latency isn't worth it
I've been happily playing Overwatch 2 on Linux for a couple months now. I need gamescope to get it to play nicely with my multiple monitors, and it crashes maybe once a month, but performance is great and I have no major complaints. I'm never going back to Windows, except for work where it isn't optional :(
Ok, maybe I oversold this a little bit. It's running smooth now, getting it to run smooth was not easy. I'm on Ubuntu. I spent a few days in a debug loop. Run steam from the terminal to get a log stream, keep an eye on CPU and GPU utilization and temperature, and futz around in the training range or vs AI bots (more "realistic" than training range). Identify which components of the system aren't performing up to spec. CPU running hot? GPU not being utilized? Steam emitting warning messages? If hardware all looks good, it's probably a software problem somewhere. Identify, then fix. Rinse and repeat until linux performance is in the same league as Windows performance.
Things I'd try:
1. Check in game graphics settings
2. Update graphics drivers to the recommended version (may be non-trivial, I had to update my kernel version)
3. Experiment with different proton versions, including proton GE
4. Experiment with different Direct X versions (in game option)
5. Make sure CPU cooler is running
6. Make sure GPU is being used
7. Use gamescope to configure a virtual monitor that exactly matches the capabilities of your physical monitor
I no longer support Blizz, so can't weigh in specifically, but: have you tried PrtotonGE? The are also the Proton forks, such as CachyOS's one that support wayland directly (which is in WINE, but not Proton yet) - might be xwayland relayed?
Also, try `LD_PRELOAD="" %command%` to disable steaminput, which can cause input stuttering after around 45min on some machines (such as mine).
Python and Javascript (in the browser) due to their single threaded nature. C++ too as long as you have a std::atomic on the left hand side (since they overload the operator).
There is NB_INPLACE_ADD... but I'm struggling to find enough details to be truly confident :\ possibly its existence is misleading other people (thus me) to think += is a single operation in bytecode.
I think cooling in a chip vs cooling in space are two orthogonal problems. In a chip, the problem is getting the heat to the heatsink where it can be efficiently dissipated into the much larger heatsink of the surrounding environment. In space, the problem is that the only way to dissipate heat is thermal radiation because you're in a vacuum.
And why shouldn't they? I would expect the components of GDP to correlate with living standards even if GDP does not measure it as accurately as possible.
It's harder then it first seems. The root problem is that for text like "hallo", correcting to "hello" may be fixing an error or introducing an error. In general, the more aggressive your error correction, the more errors you inadvertently introduce. You can try and make a judgement based on context ("hallo, how are you?"), which certainly helps, but it's only a mitigation. Light error correction is common and effective, but you can't push it to a full solution. The only way to fully solve this problem is to look at the entire document at once so you have maximum context available, and this is what non-traditional OCR attempts to do.
Okay, but there way more common errors that should be easy to fix. "He11o", "Emest Herningway", incorrect diacritics like the other person mentioned, etc.
One question, how portable are performance benefits from tweaks to memory alignment? Is this something where going beyond rough heuristics (sequential access = good, order of magnitude cache sizes, etc) requires knowing exactly what platform you're targeting?
reply