This may add a bit of colour to a technical debate going on right now in the RIS...

monocasa · on Oct 24, 2023

There's a bit more context rumbling under the surface.

Not too long ago, Qualcomm bought NUVIA, a designer of high performance arm64 cores that can theoretically compete with Apple cores on perf. Arm pretty much immediately sued saying that the specifics of the licenses that Qualcomm and NUVIA have mean that cores developed under NUVIA's license can't be transferred to Qualcomm's license.[0] Qualcomm obviously disagrees. Whatever happens those cores as they exist today are going to be stuck in litigation for longer than they're relevant.

Qualcomm's proposal smells strongly like they're doing the minimum to strap a RISC-V decoder to the front of these cores. For whatever reason the seem hell bent on only changing the part of the front end that's the 'pure function that converts bit patterns of ops to bit patterns of micro-ops'. Arm64 is only 32bit aligned instructions, so they don't want to support anything else.

At the end of the day, the C extension really isn't that bad to support in a high perf core if you go in wanting to support it. The canonical design (not just for RISC-V but high end designs like Intel and AMD too) is to have I$ lines fill into a shift register, have some hardware on whatever period your alignment boundary is that reports 'if an instruction started here, how long is it', and a second stage (logically, it doesn't have to be an actual clock stage) that looks at all of those reports generates the instruction boundaries and feeds them into the decoders. At this point everything is also marked for validity (ie. did an I$ line not come in because of a TLB permissions failure or something).

[0] - https://www.reuters.com/legal/chips-tech-firm-arm-sues-qualc...

Findecanor · on Oct 24, 2023

> Qualcomm's proposal smells strongly like they're doing the minimum to strap a RISC-V decoder to the front of these cores.

Hmm.. At about the same time as the proposal to drop C from RVA, Qualcomm also proposed an instruction-set extension [1] that smells very much of ARM's ISA (at least to my nose). It also has several issues to criticise, IMHO.

[1] https://lists.riscv.org/g/tech-profiles/attachment/332/0/cod...

ekiwi · on Oct 24, 2023

The 32-bit aligned instruction assumption is probably baked into their low-level caches, branch predictors etc. That might mean much more significant work for switching to 16-bit instructions than they are willing to do.

monocasa · on Oct 24, 2023

I don't think anyone bakes instruction alignment into their caches since the early 2000s, and adding an extra bit to the branch predictors isn't that big of a deal. It's got to be the first or second stage of their front end right before the decoders.

phire · on Oct 24, 2023

Why not bake instruction alignment into the cache? When you can assume instructions will always be 32bit aligned, then you can simplify the icache read port and simplify the data path from the read port to the instruction decoder. Seems like it would be an oversight to not optimise for that.

Though, I suspect that's easy problem to fix. The more pressing issue is what happens after the decoders. I understand this is a very wide design, decoding say 10 instructions per cycle.

There might be a single 16bit instruction in the middle of that block 40 bytes, changing the alignment halfway though. To keep the same throughput, Qualcomm now need 20 decoders, one attempting to decode on every 16bit boundary. The extra decoders waste power and die space.

Even worse, they somehow need to collect the first 10 valid instructions from those 20 decoders. I really doubt they have enough slack to do that inside the decode stage, or the next stage, so Qualcomm might find them selves adding an entire extra pipeline stage, (probably before decode, so they can have 20 simpler length decoders feeding into 10 full decoders on the next) just to deal with possible misaligned instructions.

I don't know how flexible their design is, it's quite possible adding an entire extra pipeline stage is a big deal. Much bigger than just rewriting the instruction decoders to 32bit RISC-V.

monocasa · on Oct 24, 2023

Because RISC-V was designed to be trivial to decode length for, you simply need to look at the top two bits of each 16bit word to tell if it's a 32bit or 16bit instruction. At that point, spending the extra I$ budget isn't worth it. Those 20 'simple decoders' are literally just each one 2nand gate. Adding complexity to the I$ hasn't even made sense for x86 in two decades, because of the extra area needed for the I$ versus the extra decode logic. And that's a place where this extra decode is legitimately an extra pipeline stage.

> I don't know how flexible their design is, it's quite possible adding an entire extra pipeline stage is a big deal. Much bigger than just rewriting the instruction decoders to 32bit RISC-V.

I'm sure it is legitimately simpler for them. I'm not sure we should bend over backwards and bring down the rest of the industry because they don't want to do it. Veyron, Tenstorrent were showing off high perf designs with RV-C.

phire · on Oct 24, 2023

It doesn't matter how optimised the length decoding is. Not doing it is still faster.

For an 8-wide or 10-wide design, the propagation delays are getting too long to do it in all in single cycle. So you need the extra pipeline stage. The longer pipeline translates to more cycles wasted on branch mispredits.

RISC-V code is only about 6-14% denser than Aarch64 [1], I'm really not sure the extra complexity is worth it. Especially since Aarch64 still ends up with a lower instruction count, so it will be faster whenever you are decode limited instead of icache limited.

> Adding complexity to the I$ hasn't even made sense for x86 in two decades

Hang on. Limiting the Icache to only 32bit aligned access actually simplifies it.

And since the NUVIA core was originally an aarch64 core, why wouldn't they optimise for hardcoded 32bit alignment and get a slightly smaller Icache?

[1] https://www.bitsnbites.eu/cisc-vs-risc-code-density/

monocasa · on Oct 24, 2023

> Hang on. Limiting the Icache to only 32bit aligned access actually simplifies it.

Even x86 only reads 16 or 32 byte aligned fields out of the I$, then shifts them. There's not extra I$ complexity. You still have to do that shift at some point, in case you don't jump 32 byte aligned address. You also ideally don't want to only hit peak decode bandwidth starting on aligned 32 byte program counters, so that whole shift register thing is pretty much a requirement. And that's where most of the propagation delays are.

> RISC-V code is only about 6-14% denser than Aarch64 [1], I'm really not sure the extra complexity is worth it. Especially since Aarch64 still ends up with a lower instruction count, so it will be faster whenever you are decode limited instead of icache limited.

There's heavy use of fusion, and fwiw, the M1 also heavily fuses into micro ops too (and I'm sure the AArch64 morph of NUVIA's cores do too).

Symmetry · on Oct 24, 2023

Under a classic RISC architectures you can't jump to non-aligned addresses. That lets you specify jumps that are 4 times longer for the same number of bits in your jump instruction. Here's MIPS as an example:

https://en.wikibooks.org/wiki/MIPS_Assembly/Instruction_Form...

monocasa · on Oct 24, 2023

Classic RISC was targeting about 20k gates and isn't really applicable here.

Symmetry · on Oct 24, 2023

AArch64 does the same thing.

https://valsamaras.medium.com/arm-64-assembly-series-branch-...

And it's not only a way of decreasing code size. It help with security too. If you can have an innocuous looking bit of binary starting at address X that turns into a piece of malware if you dump to instruction X+1 that's a serious problem.

https://mainisusuallyafunction.blogspot.com/2012/11/attackin...

RISC-V, I'm pretty sure, enforces 16 bit alignment and is self synchronizing so it doesn't suffer from this despite being variable length. But if it allowed the PC to be pointed at an instruction with a 1 byte offset then it might be.

As far as I'm aware every RISC ISA that's had any commercial succss does this. HP RISC, SPARC, POWER, MIPS, Arm, RISC-V, etc.

monocasa · on Oct 24, 2023

> And it's not only a way of decreasing code size. And RISC-V has better code density than AArch64.

> It help with security too. If you can have an innocuous looking bit of binary starting at address X that turns into a piece of malware if you dump to instruction X+1 that's a serious problem.

JIT spraying attacks work just fine on aligned architectures too, hence why Linux hardened the AArch64 BPF JIT as well: https://linux-kernel.vger.kernel.narkive.com/M0Qk08uz/patch-...

Additionally, MIPS these days has a compressed extension to their ISA too, heavily inspired by RV-C. https://mips.com/products/architectures/nanomips/

Symmetry · on Oct 24, 2023

Not all JIT spraying relies on byte offsets to get past JIT filters, the attack I gave is just an example.

And NanoMips requires instructions to be word aligned just like everybody else, it's just that it requires 16 bit alignment rather than 32. Attempting to access an odd PC address will result in an access error according to this:

https://s3-eu-west-1.amazonaws.com/downloads-mips/I7200/I720...

monocasa · on Oct 24, 2023

> And NanoMips requires instructions to be word aligned just like everybody else, it's just that it requires 16 bit alignment rather than 32. Attempting to access an odd PC address will result in an access error according to this:

That's the same as RV-C.

Symmetry · on Oct 24, 2023

Right, and I mentioned RISC-V as yet another sane RISC architecture that requires word alignment in instruction access. But the fact that it requires alignment means that the word size has implications for the instruction cache design and the complexity of the piping there.

I don't have a strong opinion on whether the C extension is a net good or bad for high performance designs, but I do strongly believe that it comes with costs as well as benefits.

hajile · on Oct 24, 2023

Back in 2019, RISC-V was 15-20% smaller than x86 (up to 85% smaller in some cases) and was 20-30% smaller than ARM64 (up to 50% smaller in some cases).

https://project-archive.inf.ed.ac.uk/ug4/20191424/ug4_proj.p...

Since then, RISC-V has added a bunch more instructions that ARM/x86 already had which has made RISC-V even smaller relative to them.

rwmj · on Oct 24, 2023

No idea if this is true for Qualcomm, but people from Rivos have also been in that meeting arguing against the C extension and as far as I know Rivos have no in-house Arm cores they are trying to reuse.

monocasa · on Oct 24, 2023

Rivos was formed from a bunch of ex-Apple CPU engineers. I'm sure they would feel more comfortable with a closer to AArch64 derived design as well.

gchadwick · on Oct 24, 2023

They might also know a bunch of techniques to give high performance that only work if you've got nice 32-bit only aligned instructions!

kajiryoji · on Oct 24, 2023

Haha that'd be a little counterintuitive given all 32-bit aligned is the trivial case for decoding variable length instructions, unless you're thinking about prefetching/branch prediction etc

lmm · on Oct 24, 2023

> all 32-bit aligned is the trivial case for decoding variable length instructions

That's the point? You can go faster if everything is 32-bit aligned, i.e. you don't have variable length instructions.

kajiryoji · on Oct 24, 2023

The shift register design sounds quite expensive. You're essentially constructing <issue width> number of crossbars of 32 times <comparator widths> connected to a bunch of comparators to determine instruction boundary. In a wide design you also need to do this across multiple 32-bit lines

monocasa · on Oct 24, 2023

Well, half that because the instructions are 16 bit aligned. And approaching half of even that because not every decoder needs access to every offset. Decoder zero doesn't need any. Decoder one only needs two, etc.

But you need most of that anyway because you need to handle program counters that aren't 32 byte aligned, so you need to either do it before hitting the decoders, or afterwards when you're throwing the micro-ops into the issue queues (which are probably much wider and therefore more expensive).

aseipp · on Oct 24, 2023

I think an example is something like opcodes crossing I-cache lines, re: fetch and decode complication; instructions are 16-bit aligned when C is present, so you can have a 32-byte instruction cross cache lines easily. At minimum it will definitely require a bunch of extra verification to handle those cases, and that's often the longest part of the whole development process anyway, so I see the reasoning for not wanting it. It doesn't matter how high performance something is or can be, if you can't prove it works to some tolerance level.

I know there's the big discussion about macro-op fusion. But in hindsight, I think a big motivator for C -- implicit or not -- was the fact that on the very low-end microcontroller or in the softcore (FPGA) world, you typically have disproportionately low amounts of SRAM available versus compute fabric. Those were the initial deployment targets (and initial successful deployments!) for RISC-V, since you need tons of extra features for "Application Class" designs. These cores often have a short pipeline and are completely in-order, so their cost and verification effort are much lower. These are (very likely) not going to implement macro fusion, at least on the medium-low end. So, increasing the effective size of the I-cache through smaller opcodes is often a straight win to increase IPC. On the other hand, Application Class designs today are typically OoO, so they achieve high IPC while still hiding miss latencies pretty effectively; smaller instructions are still good but the benefits they provide aren't as prominent. And it does use a ridiculous amount of opcode space, yes.

I wonder if they would have just been better off copying one of ARM's design principles from the very start: actual design families akin to the -M, -R, and -A series of ARM processors, created for different actual design spaces. These could actually be allowed to have (potentially large!) incompatibilities between them while still sharing a lot of the base instruction set and privileged e.g. PMP extensions could probably exist among all of them. I'd be happy to have an "Application Class" "-A series" RISC-V processor that could run Linux but didn't have compressed instructions or whatever; likewise I would probably not miss e.g. Hypervisor extensions on a microcontroller.

EDIT: Clipped an incorrect bit about ABI compatibility with the C extension. I was misremembering some details about a specific implementation!

rsaxvc · on Oct 24, 2023

> I think an example is something like opcodes crossing I-cache lines

Consider also a 16-bit aligned 32-bit instruction crossing a page boundary, potentially with different access permissions. This type of bug allowed userspace applications to hang early Cortex-A8 based phones(ARM errata 657417).

rsaxvc · on Oct 25, 2023

See also: Intel SKX102, whose documentation has nice diagrams.

mort96 · on Oct 24, 2023

> Another big issue for Application Class systems IIRC -- unrelated to all this -- is that I don't think hardware implementing C can actually run binaries compiled without it.

I believe this is incorrect? I believe RV{32,64}-with-C is simply a superset of RV{32,64}-without-C. Now I have only implemented RV32I, so I'm not that familiar with the C extension or other extensions for that matter, but in my digging through the various RV specs, I haven't found anything which suggests that implementing C requires breaking code compiled without the use of C.

Do you have any details?

aseipp · on Oct 24, 2023

Nope, I was wrong! I was curious about a reference and went digging, and I was misremembering the details of a particular Linux-class system that didn't implement C (Shakti), so they needed an entirely new set of binary packages from distros to support them.

Wish I could use strike-outs here, but oh well.

chriswarbo · on Oct 24, 2023

F̶Y̶I̶ ̶y̶o̶u̶ ̶c̶a̶n̶ ̶s̶t̶r̶i̶k̶e̶-̶o̶u̶t̶ ̶w̶i̶t̶h̶ ̶U̶n̶i̶c̶o̶d̶e̶ (e.g. via https://yaytext.com/strike )

saagarjha · on Oct 24, 2023

It looks atrocious though.

dathinab · on Oct 31, 2023

which is strange tbh. it should not

i̵t̵ ̵t̵u̵r̵n̵s̵ ̵o̵u̵t̵ ̶t̶h̶e̶r̶e̶ ̶a̶r̶e̶ ̶d̶i̶f̶f̶e̶r̶e̶n̶t̶ ̴s̴t̴r̴i̴c̴t̴ ̴t̴h̴r̴o̴u̴g̴h̴ ̴s̴t̴y̴l̴e̴s̴

and the font HN uses doesn't handle any of them well

Bender · on Oct 26, 2023

No harm in emailing dang asking for a feature request. hn@ycombinator.com

Perhaps if enough people asked for it then the <s> markdown equiv could be added ~~ in some apps.

pclmulqdq · on Oct 24, 2023

The FPGA world rarely cares about RAM limits on MCU cores unless they are trying to make something that benchmarks really well. When they have to be tiny, most of those MCUs run fairly short programs out of a single block RAM (which is a few kB), and actually care more about the LUT count than the BRAM count, making the C instructions actively bad.

(This mostly applies to the high end, but designers on low-end devices may be more SRAM-constrained)

aseipp · on Oct 24, 2023

Good points; though I admit, I was thinking more Linux-capable/"Linux-class" cores in the mid-range FPGAs which is where I spend a bunch of my time. For that, the paltry cache sizes due to limited amounts of BRAM are more noticeable (especially when other features may compete for them) so minimizing icache footprint as much as possible is win. Admittedly that's probably a marginal case versus the tiny softcores you mention; workloads that need high performance cores will practically be better off running on attached ARM processors or whatever.

Anyway, all that aside, I personally wouldn't be sad to see the C extensions go away. I'm actively designing a RISC-V core for a small game console, and probably won't implement them, and will spend the time on more DV instead. I don't expect them to have any meaningful benefits for my case and mostly increase frontend complexity.

pclmulqdq · on Oct 24, 2023

I have made a few RISC-V cores for FPGAs personally, and I would say that the C extensions are 100% not worth your time to implement unless you need absolutely minimum RAM size.

nolist_policy · on Oct 24, 2023

https://riscv.org/wp-content/uploads/2019/06/riscv-spec.pdf

> The C extension is compatible with all other standard instruction extensions. The C extension allows 16-bit instructions to be freely intermixed with 32-bit instructions, with the latter now able to start on any 16-bit boundary [...]

pclmulqdq · on Oct 24, 2023

As a chip designer who has made a few RISC-V cores (including one open-source one that nobody uses), I personally hate the C instructions, and I am on Qualcomm's side here. There are just too many of them, and they really muck up instruction decoding without providing large benefits for anything but the smallest MCUs.

Maybe I should weigh in on this issue in the official channels.

hajile · on Oct 24, 2023

Consider the Linux kernel code getting 50% larger when you move from compressed to uncompressed instructions[0]. That puts RISC-V as among the least efficient ISAs out there and would make it unsuitable for most applications.

[0] https://people.eecs.berkeley.edu/~krste/papers/EECS-2016-1.p...

pclmulqdq · on Oct 24, 2023

At the scale of Linux, you don't care about code size very much, but you care a lot about working set size. 2-20% seems to be the range of working set size reductions you see in the literature, and if you compensate with other instructions, you can get back a lot of that code size.

The analysis from the SiFive folks generally doesn't include that compensation factor: it just involves a straight find-and-replace in the binary.

hajile · on Oct 24, 2023

Your assertion has two major issues.

First, you context switch a lot to and from the Linux kernel, so decreased cache pressure does matter.

Second, if you have proof that loops predominantly consist of 32-bit instructions, prove your case. To my mind, a loop is likely to use fewer registers and likely to have shorter branches and smaller immediate values. all of these seem to favor compressed instructions actually favoring working code even MORE than general code.

pclmulqdq · on Oct 25, 2023

I think we agree about working set size - that's what actually matters for performance rather than overall code size. Krste from SiFive was relatively insistent on their recent call - without any proof (citing mysterious customer calls) - that people care about code size of the Linux kernel, not working set size. The performance gain he suggested that came from C instructions due to working set size in the Linux kernel is 3%. This is the performance argument coming from the biggest proponent of the C instructions.

As to what you suggested, I have actually started putting something together to possibly send to the RISC-V foundation from my own experience implementing RISC-V designs, but pretty much nobody is asserting that loops are predominantly 32-bit instructions. Tight loops are often already sitting in a uop cache once you get to a core of reasonable size, so compressed vs uncompressed is completely irrelevant. Contrary to what you seem to be hoping for, correct arguments about working set size and performance are very subtle.

The C instructions aren't free in frequency terms, either. You have significant complexity increases in decoders and cache hierarchies to support them. Making that cost add up to 3% is not that hard.

sweetjuly · on Oct 26, 2023

I keep coming back to this table and wondering why the conclusion was that 16-bit instructions were necessary and not more complex 32-bit instructions were necessary. For example, ARMv8's LDR and LDP instructions are amazing and turn what is often 3-4+ 32-bit RISC-V instructions into just one 32-bit instruction. Making C optional and building a new "code size reduction" extension that is more suitable for large application processors (which can reasonably be assumed to use uops) would help so much more.

snvzz · on Oct 26, 2023

> Making C optional and building a new "code size reduction" extension that is more suitable for large application processors (which can reasonably be assumed to use uops) would help so much more.

Andrew at SiFive disagrees vehemently[0].

0. https://lists.riscv.org/g/tech-profiles/message/391

sweetjuly · on Oct 26, 2023

Andrew seems to mostly disagree with Qualcomm in general and is rejecting the idea not so much out of technical merit but because he doesn't believe them. Qualcomm is definitely not alone with its dislike of the C extension and trying to wholesale dismiss criticisms of it because they're coming from Qualcomm is not appropriate.

snvzz · on Oct 27, 2023

SiFive made a technical case[0] on keeping C, too.

>Qualcomm is definitely not alone

It's also worth noting that they tried to appropriate Rivos's opinion, only to be called out[1].

0. https://lists.riscv.org/g/tech-profiles/topic/slides_on_reta...

1. https://lists.riscv.org/g/tech-profiles/message/396

ythoughts · on Oct 24, 2023

Please do. And SOON -- if the current task group doesn't resolve this issue in the next couple of weeks, it will likely be passed out of that group for decision at a higher level, with less opportunity for public input.

rwmj · on Oct 25, 2023

You can join the Profiles meeting. It is every Thursday. Technically you must be an RVI member, but there is free membership for individuals.

londons_explore · on Oct 24, 2023

> The variable length instructions (currently 16 bit or 32 bit but 48 bit on the horizon) complicate instruction fetch and decode and in particular this is a problem for high performance RISC-V implementations.

I want to see variable length instructions, but a requirement for instruction alignment.

Ie. every aligned 64 bit word of RAM contain one of these:

[64 bit instruction]

[32 bit instruction][32 bit instruction]

[16 bit instruction][16 bit instruction][32 bit instruction]

[32 bit instruction][16 bit instruction][16 bit instruction]

[16 bit instruction][16 bit instruction][16 bit instruction][16 bit instruction]

That should make decode far simpler, but put a little more pressure on compilers (instructions will frequently need to be reordered to align - but a review of compiler generated code is that that frequently isn't an issue)

gchadwick · on Oct 24, 2023

As the other reply states that is effectively the Qualcomm proposal though note the 16-bit instructions likely gobble up a large amount of your 32-bit instruction space. You have to have something to identify an instruction as 16-bit which takes up 32-bit encoding space. The larger you make that identification (in terms of bits) the less encoding space it takes up but then the fewer spare bits you have to actually encoding your 16-bit instruction. RISC-V uses the bottom two bits for this purpose, one value (11) indicates a 32-bit instruction, the others are used for 16-bit instructions. So you're dedicating 75% of your 32-bit encoding space to 16-bit instructions.

londons_explore · on Oct 24, 2023

By requiring alignment, you can halve or more the size of the identifier.

Since if you have a 16 bit instruction, you know that it must be followed by another 16 bit instruction. Therefore, that 2nd instruction doesn't need the identifying bits. Or, more precisely, within a 32 bit slot, the 2^32 instructions possible need to be divided - and one way to do that is 2^31+2^30 possible 32 bit instructions, and 2^15 * 2^15 16 bit instructions. Now, the 16 bit instructions are only taking 25%, not 75% of the instruction space.

Joker_vD · on Oct 24, 2023

But now you have two kinds of 16-bit instructions, the ones for the leading position and the ones for the trailing position, and the latter ones have slightly more available functionality, right? Personally, at this point I'd think the decoder must already be complicated enough (it has either to maintain "leading/trailing/full" state between the cycles, or to decode 8/16-byte long batches at once) that you could simply give up and go for an encoding with completely irregular lengths à la x86 without much additional cost.

hajile · on Oct 24, 2023

Not necessarily.

    000x -- 64-bit instruction that uses 60 bits
    001x -- reserved
    010x -- reserved
    011x -- reserved
    100x -- two 32-bit instructions (each 30-bits)
    101x -- two 16-bit instructions then one 32-bit instruction
    110x -- one 32-bit instruction then two 16-bit instructions
    111x -- four 16-bit instructions (each 15 bits)
    xxx1 -- explicitly parallel
    xxx0 -- not explicitly parallel

Alternatively, you view them as VLIW instruction sets. This has the additional potential advantage of some explicitly parallel instructions when convenient.

phire · on Oct 24, 2023

One advantage of just sticking with only 32bit instructions is that nobody needs to write packet-aware instruction scheduling.

Even with decent instruction scheduling, you are still going to end up with a bunch of instruction slots filled with nops.

And it will be even worse if you take the next step to make it VLIW and require static scheduling within a packet.

Joker_vD · on Oct 24, 2023

In this case, it's probably not that bad as with actual VLIW: if you see that e.g. your second 16-bit instruction has to be a NOP, you just use a single 32-bit instruction instead; similarly for 32- and 64-bit mixes.

hajile · on Oct 24, 2023

The packet would be external and always fit in a cache line. You'd specify the exact instruction using 16-bit positioning. The fetcher would fetch the enclosing 64-bit group, decode, then jump to the proper location in that group.

In the absolute worst-case scenario where you are blindly jumping to the 16-bit instruction in the 4th position, you only fetch 2-3 unnecessary instructions. Decoders do get a lot more interesting on the performance end as each one will decode between 1 and 4 instructions, but this gets offset by the realization that 64-bit instructions will be used by things like SIMD/vector where you already execute fewer instructions overall.

The move to 64-bit groups also means you can increase cache size without blowing out your latency.

VLIW doesn't mean strictly static scheduling. Even Itanic was just decoding into a traditional backend by the time it retired. You would view it more as optional parallelism hints when marked.

I'd also note that it matches up with VLIW rather well. 64-bit instructions will tend to be SIMD instructions or very long jumps. Both of these are fine without VLIW.

Two 32-bit instructions make it a lot easier to find parallelism and they have lots of room to mark exactly when they are VLIW and when they are not. One 32-bit with two 16-bit still gives the 32-bit room to mark if it's VLIW, so you can turn it off on the worst cases.

The only point where it potentially becomes hard is four 16-bit instructions, but you can either lose a bit of density switching to the 32+16+16 format to not be parallel or you can use all 4 together and make sure they're parallel (or add another marker bit, but that seems like its own problem).

phire · on Oct 24, 2023

I think if you have 64bit packets, you might as well align jump targets to the 64bit boundary.

I'd rather have an extra nop or two before jump targets than blindly throw 1-3 instructions worth of decoding bandwidth on jumps (which are often hot)

hajile · on Oct 24, 2023

If you're fetching 128-bit cache lines, you're already "wasting" cache. Further, decoding 1-3 NOP instructions isn't much different from decoding 1-3 extra instructions except that it adversely affects total code density.

If you don't want to decode the extra instructions, you don't have to. If the last 2 bits of the jump are zero, you need the whole instruction block. If the last bit is zero, jump to the 35th bit and begin decoding while looking at the first nibble to see if it's a single 32-bit instruction or two 16-bit instructions. And finally, if it ends with a 1, it's the last instruction and must be the last 15 bits.

All that said, if you're using a uop cache and aligning it with I-cache, you're already going to just decode all the things and move on knowing that there's a decent chance you jump back to them later anyway.

phire · on Oct 25, 2023

But if you don't have a uop cache (which is quite feasible with a RISC-V or AArch64 style ISA), then decode bandwidth is much more important than a few NOPs in icache.

Presumably your high performance core has at least three of these 64bit wide decoders, for a frontend that takes a 64bit aligned 192bit block every cycle and decodes three 64bit instructions, six 32bit instructions, twelve 16bit instructions, or some combination of all sizes every cycle.

If you implement unaligned jump targets, then the decoders still need to fetch 64bit aligned blocks to get the length bits. For every unaligned jump, that's upto a third of your instruction decode slots site idle for the first cycle. This might mean the difference between executing a tight loop in one cycle or two.

A similar thing applies to a low gate count version of the core, a design where your instruction decoder targets one 32bit or 16bit instruction per cycle (and a 64bit instruction every second cycle). On unaligned jumps, such a decoder still needs to load the first 32bits of the instruction first to check the length decoding, and waste an entire cycle on every single branch.

Allowing unaligned jump targets might keep a few NOPs out of icache (depending on how good the instruction scheduler is), but it costs you cycles in tight branchy code.

Knowing compiler authors, if you have this style of ISA and even it does support unaligned jump targets, they are still going to default to inserting NOPs to align every single jump target, just because the performance is notably better on aligned jump targets and they have no idea if this branch target is hot or cold.

So my argument is that you might as well enforce jump target alignment of 64 bits anyway. Allow all implementations gain the small wins from assuming that all targets are 64bit aligned, and use the 2 extra bits to make your relative jump instructions have four times as much range.

hajile · on Oct 25, 2023

Which is easier to decode?

[jmp nop nop], [addi xxx]

OR

[xxx jmp], [nop nop addi]

OR

[xxx jmp], [unused, addi]

All of these tie up your entire decoder, but some tie it up with potentially useful information. That seems superior to me.

phire · on Oct 27, 2023

It's only unconditional jumps that might have NOPs following them.

For conditional jumps (which are pretty common), the extra instructions in the packet will be executed whenever the branch isn't taken.

And instruction scheduling can actually do some optimisation here. If you have a loop with an unconditional jump at the end and an unaligned target, you can do partial loop unrolling, for example:

With [xxx, inst_1, inst_2], [inst3]...(loop body) ...[jmp to inst_1, nop, nop], you can repack the final jump packet as [inst_1, inst_2, jump to inst_3]

This partial loop unrolling actually is much better for performance than not wasting I-cache as it has reduces the number of instruction decoder packets per iteration by one. Compilers will implement this anyway, even if you do support mid-packet jump targets.

Finally, compilers already tend to put nops after jumps and returns on current ISAs, because they want certain jump targets (function entry points, jump table entries) to be aligned to cache lines.

KMag · on Oct 24, 2023

Don't forget the possibility of 5x 12-bit instructions. In particular, if you have only one or two possibilities for destination registers for each of the 5 positions (so an accumulator-like model), you could still have a quite useful set of 12-bit instructions.

namibj · on Oct 24, 2023

No, the idea was to say "prefix 0"=>31bit, "prefix 10"=>30bit, "prefix 11"=>2*15bit. If you need you can split the two bits to have the two 15 bit chunks aligned identically.

Symmetry · on Oct 24, 2023

Once you're moving to alignment within a larger fixed width block you don't even need to stick to byte boundaries. I've got a toy ISA I've played around with that breaks 64 bit chunks into a 62 bit instruction, a 40 and a 21 bit instruction, or 3 21 bit instructions.

leosarev · on Oct 24, 2023

I think Itanium did something like that

Symmetry · on Oct 24, 2023

Yup, 128 bit bundles with 3 instructions each, and ways to indicate that different bundles could execute in parallel.

pklausler · on Oct 24, 2023

The CDC 6600 (1963) had 60-bit words and both 15-bit and 30-bit instructions, with the restriction that 30-bit instructions couldn't straddle a word boundary. The COMPASS assembler would sometimes have to insert a 15-bit no-op instruction to "force upper". Careful optimizing programmers and compilers would try to minimize "forcing upper" to avoid wasting precious space in the 7-word instruction "stack".

So it's been done, and is not a big deal.

_ugfj · on Oct 24, 2023

was it called assembler though? I was looking and https://web.archive.org/web/20120910064824/http://www.bitsav... does not say so but http://www.bitsavers.org/pdf/cdc/cyber/cyber_70/60225100_Ext... does refer to "COMPASS assembly language". Interesitng.

camel-cdr · on Oct 24, 2023

This is basically what qualcomm proposes, 32 bit instructions and 64 bit aligned 64 bit instructions.

I don't think we have real data on it, but I suspect that the negative impact of this would effect 16/32/48/64 way more than just 32/64.

londons_explore · on Oct 24, 2023

I would like to also have aligned 16 bit instructions. And maybe even aligned 8 bit instructions for very common things like "decrement register 0" or "test if register 0 is greater than or equal to zero". "Jump back by 10 instructions", etc. Those instructions get widely used in tight loops, so might as well be smaller.

adgjlsfhk1 · on Oct 24, 2023

8 bit instructions are a really bad idea. you only get a tiny number of them and they significantly increase decode complexity (and massively reduce the number of larger instructions available)

_a_a_a_ · on Oct 24, 2023

Others have pointed out adding bits to identify instruction types eats into your instruction length, so let's go stupid big time: what if you had the instructions as described here, without any instruction length being part of the instruction, but have that stored separately? (3 bits would be plenty per word) You might put it 1. as a contiguous bit string somewhere, or you might 2. put it at the bottom of the cache line that holds the instructions (the cache line being 512 bits I assume).

Okay, for 1. you'd have to do two fetches to get stuff into the I-cache (but not if it's part of the same cache line, option 2.) and of course you're going to reduce instruction density because you're using up cache, but there's nothing you can do about that, but at least it would allow n-bit instructions to be genuinely n-bits long which is a big advantage.

That this hasn't been done before to my knowledge is proof that it's a rotten idea, but can the experts here please explain why – thanks

londons_explore · on Oct 24, 2023

> you'd have to do two fetches

I think this is the big downside. You're effectively taking information which will always be needed at the same time, and storing it in two different places.

There is never a need for one piece of information without the other, so why not store it together.

_a_a_a_ · on Oct 24, 2023

> There is never a need for one piece of information without the other, so why not store it together.

Why not? as I said, so you can have full-length instructions!

And you can store it together in the same fetchable unit - the cache line (my option 2)

TanjB · on Oct 24, 2023

Interesting idea. Effectively moving the extra decode stage in front of the Icache, making the Icache a bit like a CISC trace/microOp cache. On a 512b line you would add 32 bits to mark the instruction boundaries. At which point you start to wonder if there is anything else worth adding that simplifies the later decode chain. And if the roughly 5% adder to Icache size (figuring less than 1/16th since a lot of shared overhead) is worth it.

KingLancelot · on Oct 24, 2023

Why not treat it like Unicode does and just have two marker instructions before and after the compressed ones?

Start compressed instructions <size>

Compressed instructions

End compressed instructions <size>

KMag · on Oct 24, 2023

Which Unicode encoding are you talking about? It sounds a bit like you're talking about UTF-16 conjugate pairs, but that's not how those work. It's not how UTF-8 or UTF-32 work. So, which encoding is this?

_a_a_a_ · on Oct 24, 2023

If I understand you correctly, the guy I'm responding to is proposing allowing the mixing of different sized instructions. Your suggestion effectively says "I'm starting a run of compressed instructions/I'm finishing a run of compressed instructions" which is a different proposition. Just my take though.

magicalhippo · on Oct 25, 2023

> let's go stupid big time

Wouldn't that be to Huffman encode the instructions? Fixed table, but still, would save a lot of bits on the common instructions surely...

jstanley · on Oct 24, 2023

> instructions will frequently need to be reordered to align

Can't you pad it with nops up to the alignment boundary?

Even if there's not an explicit nop instruction in 16-bit and 32-bit variants (I don't know) there's surely something you can find that will have no side effects.

londons_explore · on Oct 24, 2023

Yes, of course - but in general you want to avoid padding with nops because it makes the code larger (which, as well as costing more RAM, also uses more power and time to read the code from RAM, and fits less code into the instruction cache, which makes the power and time cost of reading code from RAM even bigger).

If you can make a compiler fill those NOP slots with useful instructions, then all the better.

It adds complexity for humans writing assembly code by hand, but that is a tiny minority of code now.

KMag · on Oct 24, 2023

For one, I don't see why you would ever pad a 16-bit instruction with a 16-bit noop instead of just using the 32-bit equivalent instruction. That way, you can skip decoding the no-op.

StillBored · on Oct 24, 2023

Because you have a long odd numbered chain of 16-bit ops. Ex: 15 16-bit ops, leaves you with a 16bit NOP in order to realign for the following 32-bit op.

KMag · on Oct 25, 2023

To be more clear: if a 16-bit instruction is 32-bit aligned, then you might want a 16-bit noop so that the instruction following the noop is also 32-bit aligned. But, in that case, you could just use a 32-bit aligned 32-bit instruction at the same location. No padding, one instruction saved, and the following instruction is still 32-bit aligned.

If the 16-bit instruction isn't 32-bit aligned, then the following instruction will be 32-bit aligned with no padding.

So, equivalently: "I don't know why you'd ever want to add padding after a 16-bit instruction in order to force the next instruction to not be 32-bit aligned." Is there such a use case (other than the obvious use case of checking behavior/performance of the sub-optimal case, or writing a noop-slide for an exploit payload)?

Joker_vD · on Oct 24, 2023

Use 14 16-bit ops instead, and use a regular 32-bit op as the 15th instruction (which is already correctly aligned since 14 is an even number).

mshockwave · on Oct 25, 2023

Motorola 68k has this kind of aligned variable length instructions (in their case it’s always aligned on 16 bit boundaries). It’s not super difficult to support this in compilers though

Pet_Ant · on Oct 24, 2023

> They're also saying that if you move forward with the C extension in RVA23 now there's no real backing out of it.

That doesn't seem correct. I think adding and dropping C for desktop/server workloads would be relatively easy. Most of what will be run on it is either open source (Linux, Apache et al) or Java/Python/Go/.Net. Either way, I'd expect Oracle or somebody to support both with a single installer. This isn't x86 where there is a lot of binaries with no source, or lots of janky code that assumes x86 that we need backward compatibility. (Note: IIRC RV32A is the "application" profile, not for embedded where hand tuned assembly is a real thing, things are much more fragile there).

That said, just like Linux supported multiple x86 based platforms (PC-98), I'd imagine Debian and others would support non-C processors with distros, so I don't think it would really hurt Qualcomm if it's kept and they don't include it.

> They also say that there's lots of implementations with C in already, so backing out of it now disadvantages those implementations.

Ugh. So we should hold hold onto something, even if it is a bad idea just because other people wasted time on it? That seems like a very crab-bucket mentality. Not saying it should be removed, but the decision should be technical, not based on favoring certain players.

monocasa · on Oct 24, 2023

> I think adding and dropping C for desktop/server workloads would be relatively easy.

It's not super easy because standard RISC-V without the c extension balloons code size by about 30%. So Qualcomm is proposing a set of custom extensions on top of that to get the code size back down. It's not clear what the patent situation is on those extensions since they're so obviously AArch64 inspired.

hajile · on Oct 24, 2023

More than 30%. For example, in this 2016 paper, the Linux kernel using compression was 67% the size of the non-compressed instructions.

Stated in the reverse, removing compressed instructions would increase the kernel size by 50%.

https://people.eecs.berkeley.edu/~krste/papers/EECS-2016-1.p...

Pet_Ant · on Oct 24, 2023

From that document what is interesting is that they explicitly avoided the approach that Qualcomm are suggesting:

p. 51

> each RVC instruction must expand into a single RISC-V instruction. The reasons for this constraint are twofold. Most importantly, it simplifies the implementation and verification of RVC processors: RVC instructions can simply be expanded into their base ISA counterparts during instruction decode, and so the backend of the processor can be largely agnostic to their existence. ... This constraint does, however, preclude some important code size optimizations: notably, load-multiple and store-multiple instructions, a common feature of other compressed RISC ISAs, do not fit this template. ... Given these constraints, the ISA design problem reduces to a simple tradeoff between compression ratio and ease of instruction decode cost. ... The dictionary lookup is costly, offsetting the instruction fetch energy savings. It also adds significant latency to instruction decode, likely reducing performance and further offsetting the energy savings. Finally, the dictionary adds to the architectural state, increasing context switch time and memory usage.

StillBored · on Oct 24, 2023

Debian maybe, but then again maybe not. The distros don't like supporting a lot of separate arch revisions because they tend to behave like different arches. That is why most of them have dropped 32-bit arm support, it was from a distro perspective a completely separate arch despite being able to run on much the same HW as the 64-bit arm distro. Given most arm devices made in the past ~decade have been 64-bit it was an obvious choice. People with 32-bit binary apps can run them on the 64-bit distro, and the maintainers don't have to keep building/testing/fixing an entirely separate set of machine images.

So, if someone forks the arch such that two different distros are required based on HW, its just going to fragment the distro's too because some of them will just pick one or the other profile.

dezgeg · on Oct 24, 2023

I doubt many people would consider it easy. How many people running their stuff on EC2 would like to hear at some point that to upgrade the newest instance type you need to remake your VMs/containers?

Pet_Ant · on Oct 24, 2023

I mean it's irritating, but we all have CI/CD pipelines now don't we? I'd see it being a project for a single team for a month to get it changed, tuned, and verified. We regularly build both x86 images and ARM images for devs on Macs. It's really not that difficult when you aren't redoing manual steps.

klelatti · on Oct 24, 2023

Just to add that Qualcomm have also proposed a new extension that will help keep code size down without using the C extension. It includes new load/store addressing modes, pre-post increment load/stores and load/store pairs amongst others.

It would seem to take RISC-V closer to AArch64 in approach?

gchadwick · on Oct 24, 2023

Yes exactly, the 'existence proof of a competitive architecture using exclusively 32-bit instructions' has often been reference.

Qualcomm's proposal is all instructions are aligned to their size. Initially that means everything is a 32-bit instruction, now with a lot more green-field encoding space to play with (so less need to have larger instructions). 64-bit instructions would be introduced (aligned on a 64-bit boundary) when needed with the expectation they'd be used for rare operations and 48-bit instructions wouldn't happen.

The SiFive (and original RISC-V architects view) is RISC-V is meant to be a variable length instruction set and a mix of 16/32/48 provides better static code size along with better dynamic code size meaning smaller icaches needed, smaller buffers in fetch units etc.

Interesting that the architecture that was meant to be a 'purer' RISC implementation than ARM is pushing towards the more CISC style variable length instructions. In a sense Qualcomm are trying to keep it closer to the RISC ideal!

bunnie · on Oct 24, 2023

variable length instructions == CISC is not quite a correct equivalence.

The compressed 'C' extention is designed to re-use a lot of the existing decode infrastructure. On RV32 the C instructions are a strict subset of the full length instructions, so at least on RV32 it is very light weight, and it adds barely any logic to a core. It's almost always worth it to turn on C extensions versus making the cache bigger or trying to speed up main memory.

In my experience I-cache pressure is real, especially on lightweight implementations that don't have multiple levels of cache hierarchy and huge amounts of associativity to reduce the impact of an instruction cache miss.

I have played with both C and non-C variants, and also played with compiler tuning that saves code size versus 'performance' (which includes loop unrolling and thus more I cache misses). Generally smaller code size is better for power and system complexity, while keeping performance at par. Of course if you aren't as restricted on power or complexity (as is the case on a high end CPU), the calculus is different.

This kind of simplicity to me embodies the heart of RISC. If your CPU is hitting cache lines more often, you don't have to speculate as deep, don't have to re-order as much, thus less logic, less complexity, less power, higher clock rates and fewer side channels.

On the other hand, I suppose if you are already committed to deep speculation and out of order, compressed instructions might extract a disproportionate cost, maybe less so in decode and more so in precise exception handling and in tricks like register renaming.

gchadwick · on Oct 24, 2023

> variable length instructions == CISC is not quite a correct equivalence.

Yeah I'd tend to agree, in particular x86 variable length encoding is a lot more complex than the RISC-V encoding!

What I'm really getting at is CISC and RISC aren't well-defined things and it's interesting seeing how the design of RISC-V is getting pulled in different directions.

> so at least on RV32 it is very light weight, and it adds barely any logic to a core. It's almost always worth it to turn on C extensions versus making the cache bigger or trying to speed up main memory.

Definitely, and the Qualcomm proposals are that things should stay that way for RV32/low end in general. It's high-end RV64 they care about.

> On the other hand, I suppose if you are already committed to deep speculation and out of order, compressed instructions might extract a disproportionate cost, maybe less so in decode and more so in precise exception handling and in tricks like register renaming.

This is the root of it. It's easy enough to do a study demonstrating changes in static code size, also easy enough to build a low-end RISC-V processor and examine the trade-offs. It's all a lot more complex at the higher-end especially as high-end RISC-V cores are far from mature.

o11c · on Oct 24, 2023

Really, by a reasonable definition CISC was dead long before x86 became a thing. We don't even conceive of architectures where every instruction has 6 operands, including double indirection and implicit increments.

Pet_Ant · on Oct 24, 2023

> We don't even conceive of architectures where every instruction has 6 operands, including double indirection and implicit increments.

Is this an exaggeration, or were there ever such ISAs?

o11c · on Oct 24, 2023

It's a half-remembered description of VAX, but it's not much of an exaggeration if at all.

Pet_Ant · on Oct 25, 2023

For those curious, I found this VAX manual, but haven't found anything truly egregious yet:

https://www.ece.lsu.edu/ee4720/doc/vax.pdf

brucehoult · on Oct 29, 2023

MOVC6

camel-cdr · on Oct 24, 2023

> Interesting that the architecture that was meant to be a 'purer' RISC implementation than ARM is pushing towards the more CISC style variable length instructions. In a sense Qualcomm are trying to keep it closer to the RISC ideal!

The initial idea of RISC-V was pretty much, a variable length RISC isa, but sane and easy to decode. That is not the x86, "we need to add yet another prefix".

_a_a_a_ · on Oct 24, 2023

Why would you need a 64 bit instruction; what kinds of things are going to be used for it?

What does 'rare' mean here, does it mean rare in execution, or rarely appears in code? (The difference being that something might only appear once in your code but be part of your hot loop so be executed any number of times)

If they are rare in execution, what is their value over composing them of 32-bit instructions, where the (rare) overhead of doing so would be typically a amortised away?

(The only thing I can think of that 64 bit instruction seem suited to is some kind of internal CPU management instructions, but context switches etc. are relatively rare & very expensive anyway so... I don't know)

camel-cdr · on Oct 24, 2023

From the RVI thread on 48 bit instructions, 64 bit ones would probably look similar:

> There are several 48-bit instruction possibilities.

> 1. PC-relative long jump

> 2. GP-relative addressing to support large small data area, effectively giving GP-relative access to entire data address space of most programs

> 3. Load upper 32-bits of 64-bit constants or addresses

> 4. Or lower 32-bits of 64-bit constants or addresses

> 5. And with 32-bit mask

> 6. More effective ins/ext of 64-bit bit fields

Another thing thats offten discussed is moving the vtype and setvl into each vector instructions, I'm not sure if that requries 48 or 64 bit instructions.

_a_a_a_ · on Oct 24, 2023

I was really asking about 64-bit instructions specifically, but going with what you've put, if you don't mind...

> 1. PC-relative long jump

My understanding is that these are rare

> 2. GP-relative addressing to support large small data area, effectively giving GP-relative access to entire data address space of most programs

What is 'GP' here? but "...access to entire data address space of most programs" In this case you are just going to be bouncing all over the address space, substantially missing any level of cache much of the time, surely?. Maybe you get a little extra code density but you aren't going to get any extra speed to speak of.

> 3. Load upper 32-bits of 64-bit constants or addresses

> 4. Or lower 32-bits of 64-bit constants or addresses

> 5. And with 32-bit mask

Well yeah, but how common is this? I understand the alpha architecture team looked at this and found it uncommon which is why they were okay with less-than-32-bit constants. If it really speeded things up you might build a specific cache to store constants (a kind of larger, stupider, register set). It would seem a simpler solution.

I'm not sure what you mean with 6, and I'm not familiar with vtype/setvl

dzaima · on Oct 25, 2023

On vtype/setvl: in the RISC-V V extension (aka RVV / Vector (≈SIMD)), due to the 32-bit instruction length, there's a separate instruction that does some configuration (operated-on element size, register group size, masked-off element behavior, target element count), which arith/etc operations afterwards will work by. So e.g. if you wanted to add vectors of int32_t-s, you'd need something like "vsetvli x0,x0,e32,m1,ta,ma; vadd.vv dst,src1,src2"

Often one vsetvl stays valid for multiple/most/all instructions, but sometimes there's a need to toggle it for a single instruction and then toggle it back. With 48-bit or 64-bit instructions, such temporary changes could be encoded in the operation instruction itself.

Additionally, masked instructions always mask by v0, which could be expanded to allow any register (and perhaps built-in negation) by more instruction bits too.

classichasclass · on Oct 24, 2023

> My understanding is that these are rare

Depends on how many bits you had to start with. On Power ISA they aren't common either, but when they happen you need up to seven instructions (lis, ori, rldicl, oris, ori, then for branches mtctr/b(c)ctr) to specify the new address or larger value. Most other RISCs are similar when full 64-bit values must be specified. This is a significant savings.

benj111 · on Oct 24, 2023

Well you can embed longer immediates directly in the opcode.

You could have a lot more registers.

The first example, I'm not sure you'd want a full 64bit encoding space. You still aren't going to be able to load a 64bit immediate directly so I'd rather see an instruction that uses the next instruction as the immediate. But then 50% of the time you're still going to be padding this to 64bit alignment, so it's unclear to me that this is a benefit over 2 lots of the same but with 32bit immediates.

The second option is interesting. But if you've got 256 addressable registers say, what use are the 32 and 16 bit instructions that can only address a tiny proportion of those registers.

Joker_vD · on Oct 24, 2023

How do you even use all those registers? Serious question. I've toyed with a couple of 256-register ISAs, and the moment you hit function calls/parameter passing you realize that to utilize those efficiently, you really need some way to indirectly refer to registers, be it register windows, or MMIX's register slide, or Am29k's IPA/IPB/IPC registers; the only other option seems to be to perform global register allocation but that hardly works in scenarios with separate compilation/dynamic code loading.

benj111 · on Oct 24, 2023

Off the top of my head I don't really know. But then if you had asked me 20 years ago if we'd need multi core multi GHz multi GB computers to display a web page I'd probably have said no.

I suppose the os could reserve registers for itself to save swapping in and out quite so often.

Register windows for applications/functions/threads.

Or maybe something radically different, like get rid of the stack, and treat them conceptually like a list?

_a_a_a_ · on Oct 24, 2023

> I've toyed with a couple of 256-register ISAs, and the moment you hit function calls/parameter passing you realize that to utilize those efficiently

Very revealing, thanks, this had never occurred to me

hajile · on Oct 24, 2023

The sweet spot for scalar code is about 24 registers, but that leads to weird offset-bits (there's an ISA that does this, but I forget what it's called), so 32 registers is easier to implement and provides a mild improvement in the long tail of atypical functions.

On the flip side, the ability to have more registers is very good for SIMD/GPU applications.

benj111 · on Oct 24, 2023

Absolutely, I'm not saying a 64bit instruction length with 5/6/7/8 bits of registers would be bad per se. In fact I'd be interested to see where it leads.

But if you have a processor that also uses 16 bit instructions those extra registers become unusable. Thumb can't encode all registers in all instructions so you have the high registers that are significantly less useful than the low registers.

X86 is the same, never really done 64bit ASM so I don't know if they improved that.

So then you may aswell just divide up the registers so you've got 16 general purpose registers and 16 registers for simd or whatever.

classichasclass · on Oct 24, 2023

Power10 added "prefixed" instructions, which are effectively 64-bit instructions in two 32-bit halves (the nominal instruction size). They are primarily used for larger immediates and branch displacements.

https://www.talospace.com/2021/04/prefixed-instructions-and-...

TanjB · on Oct 24, 2023

MIPS had load const to high or low half. More that 40 years ago Transputer had shift-and-load 8 bit constants. Lots of ancient precedents for rare big constants.

classichasclass · on Oct 24, 2023

So does classic PowerPC, SPARC, and many other ISAs. It's the most common way to handle it on RISC. The Power10 prefixed instruction idea just expands on it.

hajile · on Oct 24, 2023

Personally, I like the idea of doubling the instruction length every time -- 16, 32, 64, 128, etc. There's a big use case on the longer instruction end for VLIW/DSP/GPU applications.

Pet_Ant · on Oct 24, 2023

AFAIK you want short instructions for VLIW because you want to pack multiple of them into a single word.

nickik · on Oct 24, 2023

If this is such a big problem, why have the other RISC-V high performnace people never made this into a big issue?

This really just seems to be Qualcomm wanting it to be more like ARM so they can use their existing cores. That seem pretty clear from what they are proposing.

ilyt · on Oct 24, 2023

They want to have the cake and eat it too with same instruction set fitting "small" (few hundred kB of flash/RAM at most) and "big" (Linux kernel running devices and up) ones.

IMO it's futile effort that unnecesarily taxes the big codes.

camel-cdr · on Oct 24, 2023

It would decrees code size but not to the degree the C extension can.

camel-cdr · on Oct 24, 2023

I think James comment summaries the problem quite well, as both aises have segnificant self interest/sunk cost for their prefered approach: https://lists.riscv.org/g/tech-profiles/topic/rva23_versus_r...

londons_explore · on Oct 24, 2023

I'm not sure either side has all that much sunk cost. So far, there isn't much binary RISC-V code thats distributed in binary form and expected to run on future processors.

So far, nearly all RISC-V is in the embedded space where everything is compiled from scratch, and a change to the ISA wouldn't have a huge impact.

Far more important to get it right for RISC-V phones/laptops/servers, where code will be distributed in binary form and expected to maintain forward and back CPU compatibility for 10+years.

Narishma · on Oct 24, 2023

The sunk cost here I think refers to the existing CPU designs of the respective camps. Qualcomm's ARM-based cores don't support an equivalent of the C extension and adding it would presumably require major and expensive rework.

phkahler · on Oct 24, 2023

>> This is basically what qualcomm proposes, 32 bit instructions and 64 bit aligned 64 bit instructions.

Well that's only sunk cost if they assumed from the start that they were going to change the design to RISC-V AND drop the C extension. In that case, it was a rather risky plan from the start - assuming they can shift the industry like that. I'm guessing RISC-V was a change of direction for them and this would make things easier short term.

dezgeg · on Oct 24, 2023

Debian has already started compiling it, and even more will be by the time this new incompatible ISA would hit the shelves.

The time to 'get it right' has already passed IMO. If a hard ISA compatibility break happens at this stage, who is going to trust that it won't happen again?

doublepg23 · on Oct 24, 2023

Can’t Debian just recompile it? I think that was OPs point. We’re not at a place yet where _only_ binaries are floating around we have all the source code for these applications.

fwsgonzo · on Oct 24, 2023

Reading through this thread, I think that if the conclusion is to disallow misaligned 16-bit instructions, then not much has to happen.You technically only need to relink the programs to fix that issue.

The question is really about how far do they want to go beyond just disallowing page-crossing / cacheline-crossing instructions.

Personally, I always thought the C extension would have been much easier to implement if it had certain rules about it. Imagine looking at a random location in memory, how can you tell where instructions begin and end? You can't.

Narishma · on Oct 24, 2023

So basically RVA is like UTF-8 and the proposed RVH will be like UTF-32?

mort96 · on Oct 24, 2023

Hmm. The C extension has played a very important PR role for RISC-V, with compressed instructions + macro-op fusion being the main argument for why the lack of addressing modes is no big deal. It would be interesting to see how big of a difference it actually makes to binary sizes in practice though.

monocasa · on Oct 24, 2023

We know, it's around 30% larger binaries. That's why qualcomm also added a bunch of custom extensions.

mort96 · on Oct 24, 2023

This is an important point which I didn't realize after reading only gchadwick's comment. It's a discussion of how best to design a compressed instructions extension, not whether to have a compressed instructions extension.

Does Qualcomm have a concrete proposal for how their version of compressed instructions would work, or is the idea more or less just "the C extensions but 32-bit instructions must be 32-bit aligned"? Have they published details somewhere?

jabl · on Oct 24, 2023

AFAIU qualcomms proposal for extra 32-bit instructions is https://lists.riscv.org/g/tech-profiles/attachment/332/0/cod...

It adds new addressing modes, and things like load/store-pair instructions.

mort96 · on Oct 24, 2023

Ah I had assumed they proposed their own alternative compressed instructions without the alignment issues, but they're actually proposing more addressing modes and adding instructions which operate directly on memory. That makes sense I guess.

monocasa · on Oct 24, 2023

Specifically they're proposing adding the kinds of instructions AArch64 uses to get halfway decent code size.

cesarb · on Oct 24, 2023

Wouldn't adding these instructions cause potential patent issues? IIRC, the original designers of RISC-V were very careful to only add instructions which were old enough that they can be assumed to not have any trouble with patents.

monocasa · on Oct 24, 2023

Yeah, I'm absolutely concerned about that too. Particularly considering that Apple apparently owns a bunch of the patents around AArch64 and cross-licenses them with ARM, so there's some ownership in there somewhere that lawyers have looked at and think are valid.

mort96 · on Oct 24, 2023

Oh, that would be great for Qualcomm. I imagine they wouldn't mind a future where, while anyone can implement the base RISC-V, only the big dogs like Qualcomm can implement the extensions everyone actually targets due to patent issues.

(I would mind that future though.)

snvzz · on Oct 25, 2023

>Oh, that would be great for Qualcomm. I imagine they wouldn't mind a future where, while anyone can implement the base RISC-V, only the big dogs like Qualcomm can implement the extensions everyone actually targets due to patent issues.

Not an issue; Qualcomm is a member of RISC-V, thus it has signed the agreement. It has legalese designed to prevent this and further entire categories of legal issues.

u320 · on Oct 24, 2023

The RISC-V community seeing an influx of large interests, used to ARM-style architectures might create conflicts. But it is really a sign of RISC-V winning.

JonChesterfield · on Oct 24, 2023

The compressed instruction extension was described somewhere as overfit to a naive gcc implementation which seems plausible. It does have a significant cost to a 32bit opcode space. Getting rid of that looks right to me, have some totally different 16 bit ISA if you must, but don't compromise the 32 bit one for it.

fidotron · on Oct 24, 2023

RISC-V architectural purism was never going to survive any major effort to deploy it. Either you make changes like what Qualcomm suggest here or you aren't competitive.

The major question is how well RISC-V will manage disputes over this sort of thing without some group such as Qualcomm deciding to just release their version anyway.

mort96 · on Oct 24, 2023

I'm wondering what part you call "architectural purism" here. Spending a whole lot of opcode space on a set of compressed instructions doesn't strike me as an especially purist solution to the code size problem, and if what camel-cdr suggests in https://news.ycombinator.com/item?id=37997077 is correct, then Qualcomm's solution is also pretty much a set of 16-bit compressed instructions, but where the 32-bit instructions must be 32-bit-aligned, which strikes me as neither significantly more nor significantly less pure than the current C extension.

To me, this looks like a reasonable argument over design decisions, where there are clear advantages and disadvantages to either side. It's basically a trade-off between code size and front-end complexity. Can you detail where exactly you see the purism thing being an issue?

fidotron · on Oct 24, 2023

I believe Qualcomm are proposing dropping 16 bit instruction support, exactly like Aarch64.

mort96 · on Oct 24, 2023

You seem to be right. I had interpreted some other responses in this thread to mean that Qualcomm has their own alternative 16-bit encoding that doesn't have the 32-bit instruction alignment issue, but it seems like they instead have a whole bunch of new 32-bit instructions which have memory operands and a bunch of addressing modes.

I see now what you mean by posing this as a conflict between ISA purists (only provide load/store all other instructions have register or immediate operands, only provide one store and one load instruction, add compressed instructions to combat binary bloat) and ISA pragmatists (add new special-case instructions with memory operands and useful addressing modes).

RetroTechie · on Oct 24, 2023

> such as Qualcomm deciding to just release their version anyway.

Qualcomm must resolve this within RISC-V International somehow. Going its own way would designate those products as non-conformant to RV spec, not passing test suites, not allowed to carry RISC-V logo or claim "RISC-V compatible", etc. With all the software headaches that would result in. Or 3rd party vendors avoiding such Qualcomm products.

So this comes down to "convince majority of RISC-V members Qualcomm's proposal is better". Or failing that, just deal with it.

Whatever happens, chances are slim that backward compatibility with existing implementations & software would be broken @ this point. So creating some kind of alternative profile seems like the most sane option?

trealira · on Oct 24, 2023

Not having the compressed instruction set extension wouldn't make Qualcomm's CPUs not RISC-V. They just wouldn't be compliant with the RVA23 profile.

snvzz · on Oct 25, 2023

>wouldn't make Qualcomm's CPUs not RISC-V.

Only if they use the op space reserved for custom extensions.

If they don't and instead step all over space that belongs to C, then they would indeed not be RISC-V.

trealira · on Oct 25, 2023

Yeah, when I wrote that, I didn't realize they were also arguing for an alternate extension that would add address modes and the like. If they fail to add this to the standard, but go ahead with implementing their extension anyway, then their CPUs would be RISC-V but with a custom instruction set extension. RV64g + their custom stuff.

snvzz · on Oct 25, 2023

Only if they use custom extension encoding space exclusively, and do not step over standard encoding space, such as what's reserved for the C extension.

Which is what I understand they're doing. Thus could not be called RISC-V.

Someone · on Oct 24, 2023

> not allowed to carry RISC-V logo or claim "RISC-V compatible", etc

If Qualcomm’s offering is performant (in its dollars, power and speed mix) and Qualcomm keeps it open enough (I think this is at the moment, as they are using an instruction set that anybody can copy), would Qualcomm’s customers care about that? If so, would Qualcomm?

The largest possible concern I see is that customers would have to be convinced that Qualcomm can deliver good compilers that don’t inadvertently spit out instructions not supported by their somewhat off-beat hardware.

nickik · on Oct 24, 2023

I think costumers have a few issus. First of all, one of the biggest reason consumer companies want RISC-V is because they can select between multiple vendors. Qualcomm would very likely be the only high performance implementation of their standard. So you are binding yourself to Qualcomm.

Second, the software ecosystem is huge, far more then compiles. And given how everybody today uses open source, making all that available for Qualcomm seems like a losing effort.

Is Qualcomm gone pay to make Android Qualcomm-RISC-V ready. Are they gone provide advanced verification suits. Formal analysis and all that stuff?

fidotron · on Oct 24, 2023

To be fair here you have historically been able to bork certain Qualcomm devices running legitimate ARM code due to them not supporting all the instructions they claimed. (And the BSP would do sneaky things to attempt to obfuscate this).

Qualcomm absolutely have the market power in the Android space to redefine a new open ISA if they want to though.

snvzz · on Oct 25, 2023

>Qualcomm absolutely have the market power in the Android space to redefine a new open ISA if they want to though.

Only if Google allows them. Which is unlikely.

nickik · on Oct 24, 2023

Seems to me this statement goes way, way to far.

RISC-V has already been deployed widely, including 64-bit. Various medium performance cores are out there being used or are being interceded soon.

There are also various companies making high performance RISC-V designs, not a single one of them has suggested that RISC-V design isn't gone work well for their effort. In fact quite the opposite.

And then Qualcomm shows up making

> group such as Qualcomm deciding to just release their version anyway.

They are free to do so. The can even call it 'RISC-V' as long as the base ISA is ok. But its unlikely to be a standard.

It would be, very, very, very hard for them to make all the distros, compilers, and other tools available for their distribution. And Google isn't gone make Android available for Qualcomm specifically unless they get paid a lot.

There is a reason Qualcomm want to be the new standard, they know they can't finance all the software work themselves.

The reality here is not that RISC-V can't be competitive, but rather that Qualcomm doesn't want to invest lots of money in changing their designs to be 'RISC-V native' so they simply propose RISC-V to be almost exactly like AArch64. This seem to me to simply be a pretty transparent Qualcomm short term money saving effort that nobody else asked for.

ribit · on Oct 24, 2023

I am not quite convinced about companies making high-performance RISC-V design. Fastest SiFive cores are at best comparable with mid-performance cores from ARM (or E-cores from Apple). With SiFive likely stopping development of these cores altogether the only high-performance RISC-V design I am aware of is the upcoming Ascalon from Tenstorrent. Which (at least on paper) looks to be comparable to Apple's A12. Not quite cutting edge performance, at any rate.

Qualcomm's proposal to add complex addressing modes to RISC-V is a design that has been tested by time and is known to work. Apple (and now ARM, with X4) are using this ISA design to deliver enthusiast-level performance in a thermal envelope of a compact handheld device. It is not at all obvious to me that RISC-V, which requires the CPU to perform additional work to bundle operations for efficient execution, is capable of the same feat.

hajile · on Oct 24, 2023

x86 is known to work and it has 15 possible lengths instead of just 2 (moreover, those lengths aren't known until decode is already started). Despite this, x86 still holds the crown for fastest overall CPU design.

RISC-V compressed instructions are provably WAY less complex to decode than x86 and only a bit more complex than ARM64. Once you get past slicing apart the instructions, RISC-V decoders are much more simple than ARM64 because the stuff they are decoding is way less complex.

ribit · on Oct 24, 2023

Yes, x86 works, but appears to pay a huge cost in power consumption to squeeze out those last few % of performance. Whether it’s just the property of how the mainstream x86 implementations have grown historically or the ISA itself remains to be seen.

I also agree that RISC-V decoders are simpler but only because the base ISA itself is very limited. Once you add functionality like FP, atomics, Zb extension, vectors etc… there is not that much difference. And the need to do fusion for address computation adds another layer of complexity on top.

hajile · on Oct 24, 2023

RISC-V already has those things and the answer seems pretty clear. P670 gets around the same performance as A78 while being 50% smaller (according to SiFive) while having vectors, atomics, floats, etc.

hardware2win · on Oct 25, 2023

>Yes, x86 works, but appears to pay a huge cost in power consumption to squeeze out those last few % of performance. Whether it’s just the property of how the mainstream x86 implementations have grown historically or the ISA itself remains to be seen.

https://chipsandcheese.com/2021/07/13/arm-or-x86-isa-doesnt-...

nickik · on Oct 24, 2023

The chief architects of the two companies who are making RISC-V cores seem to have a lot of experience if you are gone look up their resumes. Arguably more then people Qualcomm. Qualcomm bought Nuvia and are now trying to make money from that design, they never actually designed super high performance cores.

Addressing modes have never been mentioned as a limiting factor. Its not clear at all that addressing modes are a game changer for performance.

You can also argue that any other unique feature of ARM or x86 is the 'magical pile' that allows for much higher performance. The more reasonable assumption to me is that its simply about how much is invested to make it happen. I think based on its design, less investment into RISC-V will lead to a higher performance core compared to ARM because of that complexity.

Qualcomm motivation here seems pretty clear, and I don't believe its actually because they have pure technical merit at the heart of their desires.

So should I really believe the company that has clear financial motivation to push their line?

naasking · on Oct 24, 2023

> Either you make changes like what Qualcomm suggest here or you aren't competitive.

What's the evidence that the existing RISC-V approach is not competitive, and thus that Qualcomm's changes are necessary?

phkahler · on Oct 24, 2023

I've always disliked the RISC-V instruction encoding. The C extension was an after thought and IMHO could have been done better if it were designed in from the start. I'm also a fan of immediate data (after the opcode), which for RISC-V I would have made come in 16,32,64 bit sizes. The encoding of constants into the 32-bit instruction word is really ugly and also wastes opcode space.

After all the Vector, AI, and graphics stuff is hashed out I'd like to see a RISC-VI with all the same specs but totally redone instruction encoding. But maybe that's just me.

loup-vaillant · on Oct 24, 2023

> The C extension was an after thought and IMHO could have been done better if it were designed in from the start.

Not sure what you mean here: opcode space has to have been reserved for the C extension from the start, that part can’t have been an afterthought. It may have been badly designed still, but if so that must be for other reasons (working from a bad code sample is often cited).

> The encoding of constants into the 32-bit instruction word is really ugly and also wastes opcode space.

It kinda has to be to minimise fanout, and with it propagation delays and energy consumption. As a software guy I recoil in horror, but I can’t argue against faster and more efficient decoders. https://www.youtube.com/watch?v=a7EPIelcckk

> I'm also a fan of immediate data (after the opcode), which for RISC-V I would have made come in 16,32,64 bit sizes.

So was I, before I read the RISC-V specs. One possible disadvantage of separate immediate data is wasting instruction space (many constants are so much closer to zero than 127), making alignment issues even worse, and it could increase decoding latency. I would definitely do this for bytecode for a stack machine meant to be decoded by software, but for a register machine I want to instantiate in an FPGA or ASIC, I would think long and hard before making a different choice than RISC-V.

oconnor663 · on Oct 24, 2023

> The C extension was an after thought

I understand that "afterthought" can be more of a subjective comment on the design than a concrete claim about the order of events, but still I'll quote directly from the RISC-V Instruction Set Manual:

> Given the code size and energy savings of a compressed format, we wanted to build in support for a compressed format to the ISA encoding scheme rather than adding this as an afterthought

denotational · on Oct 24, 2023

From what I remember, Krste has advocated for compressed instruction sets with macro-op fusion in the uarch front-end for a while, and the design of the RV ISA is heavily inspired by this, so it’s not particularly surprising that SiFive (i.e. Krste’s (and others’) company) is opposing Qualcomm’s proposals. It will be very interesting to see what happens.

ribit · on Oct 24, 2023

What I find particularly interesting is that SiFive never actually built a high-performance CPU core. Their highest performing IP offers IPC closer to ARM's current efficiency cores. And SiFive always kept very vague about other metrics about their CPU cores (like power consumption).

brucehoult · on Oct 29, 2023

SiFive announces a new core with significantly increased performance mostly every October(ish), and has been doing so every year(ish) since U74 (the core in the fastest currently shipping SoCs) succeeded U54 in October 2018.

They have never build a core comparable to the fastest current Arm, x86 cores because they haven't been moving up the performance curve for very long. Just five generations at this point.

October 2017: U54, almost A53 competitive despite being single-issue

October 2018: U74 (dual issue), A55 class

October 2019: U84 (OoO), A72 class

June 2021: P550, A76 class

December 2021: P650, A78 class

October 2023: P870, Cortex-X3 class

SiFive can't talk a lot about power consumption because that depends not only on the core design but the entire SoC, the process node, the corner of the process node, the physical design and many other things that are under the control of SiFive's customers, not SiFive.

OhMeadhbh · on Oct 24, 2023

Not going to happen. All the tech working groups in the foundation report up through Yunsup who's the benevolent tech dictator for life. Changing the spec wouldn't go through without his approval. And compressed instructions check a check box in the low end cores SiFive is selling. Code density is pretty crappy otherwise. Or at least it's crappy compared to 8051s which is where they're competing against in some corners. The low-end RV16 or RV32 are a little more competitive against Cortex M0/3/4s, but even then only when you have the compressed instruction extension.

bonzini · on Oct 24, 2023

Those low end cores do not implement the RVA or RVB profiles.