Hacker Newsnew | past | comments | ask | show | jobs | submit | c-c-c-c-c's favoriteslogin

I'm not super familiar with ARM / ARM64 assembly and was confused as to how x0 was incremented. Was going to ask here, but decided to not be lazy and just look it up.

  const float f = *data++;


  ldr s1, [x0], #4
Turns out this instruction loads and increments x0 by 4 at the same time. It looks like you can use negative values too, so could iterate over something in reverse.

Kind of cool, I don't think x86_64 has a single instruction that can load and increment in one go.


Excellent Teardown by "Mysticial" from mersenneforum.org.

Cliffnotes:

* Zen4 AVX512 is mostly double-pumped: a 256-bit native hardware that processes two halves of the 512-bit register.

* No throttling observed

* 512-bit shuffle pipeline (!!). A powerful exception to the "double-pumping" found in most other AVX512 instructions.

* AMD seemingly handles the AVX512 mask registers better than Intel.

* Gather/Scatter slow on AMD's Zen4 implementation.

* Intel's 512-bit native load/store unit has clear advantages over AMD's 256-bit load-store unit when reading/writing to L1 cache and beyond.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: