Optimizing code on MMU-less processor versus MMU and even NUMA capable processor is vastly different.
The fact that the author achieves only a 3 to 6 times speedup on a processor running at a frequency 857 faster should have led to the conclusion that old optimizations tricks are awfully slow on modern architecture.
To be fair, execution pipeline optimization still works the same, but not taking into account the different layers of cache, the way the memory management works and even how and when actual RAM is queried will only lead to suboptimal code.
Seems like, You've got it backwards — and that makes it so much worse. ^_^
I ported from ABAP to Z80. Modern enterprise SAP system → 1976 processor.
The Z80 version is almost as fast as the "enterprise-grade" ABAP original. On my 7MHz ZX Spectrum clone, it's neck-and-neck. On the Agon Light 2, it'll probably win.
Think about that: 45-year-old hardware competing with modern SAP infrastructure on computational tasks.
This isn't "old tricks don't work on new hardware." This is "new software is so bloated that Paleolithic hardware can keep up." (but even this is nonsense - ABAP is not designed for this task =)
That Z80 code is not the equivalent of the modern code though, is it?
for example your modern code mentions 64KB lookup table.. no way you can port this to Z80 which has 64KB of address space total, shared for input, output, cache and code.
So what do those timings mean? Are those just a made up numbers for the sake of narrative?
Memory and i/o ports are in separate address spaces in Z80, but for use cases described in post ("dot product for 1536-bit vectors") i/o port space does not matter, it's all memory - and there is just a single address space there.
(Granted, some Z80-based systems had funky paging setup, but author makes no mention of those, they just say generic Z80 - and that means total 64KB for code, input data, cache and output data)
Oh, that makes a lot more sense! I was puzzled as to how the new hardware could be so slow, but an inefficient interpreter easily explains it. I've seen over 1000× slowdowns from assembly to bash, so it sounds like ABAP is close to bash.
The fact that the author achieves only a 3 to 6 times speedup on a processor running at a frequency 857 faster should have led to the conclusion that old optimizations tricks are awfully slow on modern architecture.
To be fair, execution pipeline optimization still works the same, but not taking into account the different layers of cache, the way the memory management works and even how and when actual RAM is queried will only lead to suboptimal code.