More

colonCapitalDee · 2025-11-23T22:21:21 1763936481

Great article. Can confirm, writing performance focused C# is fun. It's great having the convenience of async, LINQ, and GC for writing non-hot path "control plane" code, then pulling out Vector<T>, Span<T>, and so on for the hot path.

One question, how portable are performance benefits from tweaks to memory alignment? Is this something where going beyond rough heuristics (sequential access = good, order of magnitude cache sizes, etc) requires knowing exactly what platform you're targeting?

bofersen · 2025-11-23T23:54:09 1763942049

Author here. First of all, thanks for the compliment! It’s tough to get myself to write these days, so any motivation is appreciated.

And yes, once all the usual tricks have been exhausted, the nest step is looking at the cache/cache line sizes of the exact CPU you’re targeting and dividing the workload into units that fit inside the (lowest level possible) cache, so it’s always hot. And if you’re into this stuff, then you’re probably aware of cache-oblivious algorithms[0] as well :)

Personally, I almost never had the need to go too far into platform-specific code (except SIMD, of course), doing all the stuff in the post is 99% of the way there.

And yeah, C# is criminally underrated, I might write a post comparing high-perf code in C++ and C# in the future.

[0]: https://en.wikipedia.org/wiki/Cache-oblivious_algorithm

pjc50 · 2025-11-25T14:32:45 1764081165

Enjoying the C# appreciation.

>> C# has an awesome situation in here with its support for value types (ref structs), slices (spans), stack allocation, SIMD intrinsics (including AVX512!). You can even go bare-metal and GC-free with bflat.

There's been a really solid effort by the maintainers to improve performance in C# , especially with regard to keeping stuff off the heap. I think it's a fantastic language for doing backends in. It's unfortunate that one of the big language users, Unity, has not yet updated to the modern runtime.

hansvm · 2025-11-25T04:20:25 1764044425

One other trick I use reasonably often is using something more complicated than AoS or SoA layouts. Reasons vary (the false sharing padding in your article is one example), but cache lines are another good one. You might, e.g., want an AoSoA structure to keep the SoA portion of things on a single cache line if you know you'll always need both data elements (the entire struct), want to pack as much data in a cache line as possible, and also want that data to be aligned.

Great article by the way.

colonCapitalDee · 2025-11-20T02:08:28 1763604508

No. This is far beyond the capabilities of current AI, and will remain so for the foreseeable future. You could let your model of choice churn on this for months, and you will not get anywhere. It will be able to reach a somewhat working solution quickly, but it will soon reach a point where for every issue it fixes, it introduces one or more issues or regressions. LLMs are simply not capable of scaffolding complexity like a human, and lack the clarity and rigorousness of thought required to execute an *extremely* ambitious project like performant CUDA to ROCm translation.

impossiblefork · 2025-11-20T09:01:59 1763629319

I don't think it really is, especially not if it's turned into a system, with multiple prompts, verification, etc.

Humans have problems with IMO problems, and this kind of kernel translation is a problem which is easier to humans, where there's more probably actually more data and a problem where the system can get feedback by simply running it and measuring memory use, runtime etc.

It'd be a system and no one has developed it, but I think it can be done with present LLMs as a core mechanism. They just need to be trained with RL on this specific problem.

Anyone with a good LLM, from Google to Mistral could probably do this, but it'd be a project.

measurablefunc · 2025-11-20T02:10:43 1763604643

I don't have a model of choice. I'm just going by what I hear on twitter from Sam Altman & the people who work for him.

colonCapitalDee · 2025-11-20T03:02:51 1763607771

Well that's your problem. Here's a tip: just because someone says something doesn't mean you have to listen to them

measurablefunc · 2025-11-20T03:44:43 1763610283

That is very wise. I'll have to keep that in mind.

bigyabai · 2025-11-20T03:46:23 1763610383

This explains everything.

colonCapitalDee · 2025-11-18T19:25:33 1763493933

I have a tiny Hetzner VPS (2 vCPUs, 2 Gb RAM) in their west us datacenter that costs me $5.59 a month. I get 1 TB a month free outgoing bandwidth and unlimited incoming bandwidth, plus additional outgoing bandwidth at a rate of $1.20 per TB. I host my personal project's git LFS server there, a file server, and a Caddy instance that proxies over Tailscale to a more powerful box in my apartment. It's a great homelab architecture and I couldn't be happier with it. Thanks Hetzner!

kalaksi · 2025-11-18T21:07:35 1763500055

They still have VPSes with 2G of RAM? I'm checking the cloud price page* and you can get CX23 with 2 vCPU, 4G RAM and 20TB of traffic (seems to say that it's 1TB for US) for 3,49€/month (~4 USD).

You can save additional 0,50€ if you go with only IPv6.

Maybe it's location dependent, I can't get it to show me prices for US.

* https://www.hetzner.com/cloud/

Hetzner_OL · 2025-11-21T11:25:48 1763724348

Hi there, At the top of the website, you should be able to choose euros or dollars and the VAT rate. When people become customers and create a full customer account, we ask them to choose between euros or dollars, and then using the customer's location from their account, we automatically apply the correct VAT. So customers see the correct price on the interface where customers create their orders. But you may have to set it manually on the website. --Katie

colonCapitalDee · 2025-11-18T22:02:25 1763503345

Yep, just checked, 2 GB RAM. That CX23 sounds like a great deal, 20 TB of free outgoing is ridiculous. But I live in west us and I rarely hit my 1 TB free bandwidth cap anyway, so the added latency isn't worth it

stavros · 2025-11-18T21:51:05 1763502665

I think they retired a bunch of machines one or two weeks ago. I was prompted to upgrade my machine at no extra cost, anyway.

Hetzner_OL · 2025-11-21T11:22:14 1763724134

Thanks for the shout-out! We're really glad you're happy with it :D --Katie

colonCapitalDee · 2025-11-06T22:58:43 1762469923

I thought this was informative: https://minusx.ai/blog/decoding-claude-code/

colonCapitalDee · 2025-10-29T00:14:54 1761696894

I've been happily playing Overwatch 2 on Linux for a couple months now. I need gamescope to get it to play nicely with my multiple monitors, and it crashes maybe once a month, but performance is great and I have no major complaints. I'm never going back to Windows, except for work where it isn't optional :(

asadawadia · 2025-10-29T03:11:33 1761707493

how are you running OW2? I am using steam and proton and it is so rough to play

Windows I get 300fps and on linux ~100 and frequent dips

colonCapitalDee · 2025-10-29T04:05:16 1761710716

Ok, maybe I oversold this a little bit. It's running smooth now, getting it to run smooth was not easy. I'm on Ubuntu. I spent a few days in a debug loop. Run steam from the terminal to get a log stream, keep an eye on CPU and GPU utilization and temperature, and futz around in the training range or vs AI bots (more "realistic" than training range). Identify which components of the system aren't performing up to spec. CPU running hot? GPU not being utilized? Steam emitting warning messages? If hardware all looks good, it's probably a software problem somewhere. Identify, then fix. Rinse and repeat until linux performance is in the same league as Windows performance.

Things I'd try:

1. Check in game graphics settings

2. Update graphics drivers to the recommended version (may be non-trivial, I had to update my kernel version)

3. Experiment with different proton versions, including proton GE

4. Experiment with different Direct X versions (in game option)

5. Make sure CPU cooler is running

6. Make sure GPU is being used

7. Use gamescope to configure a virtual monitor that exactly matches the capabilities of your physical monitor

zamalek · 2025-10-29T03:53:46 1761710026

I no longer support Blizz, so can't weigh in specifically, but: have you tried PrtotonGE? The are also the Proton forks, such as CachyOS's one that support wayland directly (which is in WINE, but not Proton yet) - might be xwayland relayed?

Also, try `LD_PRELOAD="" %command%` to disable steaminput, which can cause input stuttering after around 45min on some machines (such as mine).

omikun · 2025-10-29T00:59:21 1761699561

Which OS and amd or nvidia? With win10 expired I just might make the switch on my gaming pc

colonCapitalDee · 2025-10-25T18:14:03 1761416043

Is compound assignment atomic in any major language?

delusional · 2025-10-25T19:37:44 1761421064

Python and Javascript (in the browser) due to their single threaded nature. C++ too as long as you have a std::atomic on the left hand side (since they overload the operator).

Groxx · 2025-10-25T18:47:59 1761418079

it has been in Python due to the GIL.

i80and · 2025-10-25T20:35:49 1761424549

It's not atomic even with the GIL, though: another thread can run in between the bytecode's load and increment, right?

The GIL's guarantees didn't extend to this.

Groxx · 2025-10-25T22:58:35 1761433115

There is NB_INPLACE_ADD... but I'm struggling to find enough details to be truly confident :\ possibly its existence is misleading other people (thus me) to think += is a single operation in bytecode.

Or, on further reading, maybe it applies to anything that implements `_iadd_` in C. Which does not appear to include native longs: https://github.com/python/cpython/blob/main/Objects/longobje...

arccy · 2025-10-25T18:52:25 1761418345

no... but some languages may disallow simultaneously holding a reference in different execution threads

colonCapitalDee · 2025-10-24T20:17:53 1761337073

Autorotate everything

dotancohen · 2025-10-25T07:42:29 1761378149

Thank you

colonCapitalDee · 2025-10-21T19:52:43 1761076363

I think cooling in a chip vs cooling in space are two orthogonal problems. In a chip, the problem is getting the heat to the heatsink where it can be efficiently dissipated into the much larger heatsink of the surrounding environment. In space, the problem is that the only way to dissipate heat is thermal radiation because you're in a vacuum.

altruios · 2025-10-21T20:28:28 1761078508

> only way to dissipate heat is thermal radiation

Well, besides ejecting the heat as propellent (probably water?).

Thermal radiation is probably the best way, propellent runs out eventually.

colonCapitalDee · 2025-10-20T20:58:20 1760993900

People keep coming up with alternate measures and then finding that they correlate pretty well with GDP

CGMthrowaway · 2025-10-20T21:13:17 1760994797

And why shouldn't they? I would expect the components of GDP to correlate with living standards even if GDP does not measure it as accurately as possible.

The best-known alternative I am familiar with is HDI, here is a scatterplot vs GDP: https://ourworldindata.org/grapher/human-development-index-v...

colonCapitalDee · 2025-10-20T17:21:20 1760980880

It's harder then it first seems. The root problem is that for text like "hallo", correcting to "hello" may be fixing an error or introducing an error. In general, the more aggressive your error correction, the more errors you inadvertently introduce. You can try and make a judgement based on context ("hallo, how are you?"), which certainly helps, but it's only a mitigation. Light error correction is common and effective, but you can't push it to a full solution. The only way to fully solve this problem is to look at the entire document at once so you have maximum context available, and this is what non-traditional OCR attempts to do.

fluoridation · 2025-10-20T17:28:13 1760981293

Okay, but there way more common errors that should be easy to fix. "He11o", "Emest Herningway", incorrect diacritics like the other person mentioned, etc.