More

mpu · 2025-05-16T14:50:21 1747407021

We just released a tiny (~3kloc) Python library that implements state-of-the-art inference algorithms on GPU and provides performance similar to vLLM. We believe it's a great learning vehicle for inference techniques and the code is quite easy to hack on!

mpu · on April 24, 2016

Hi, thanks for the information, but LLVM still does not provide ABI compatibility. If you reduce the struct to 3 i32, it is passed in edi, esi, and edx on my machine. However according to the ABI it should be packed in rdi and rsi.

Checkout the QBE transcription and what it will compile to http://c9x.me/paste/mGOO (there is a bit of register shuffling because hinting in regalloc is not very mature yet, but note that SSA form for the input is not required!).

riscy · on April 24, 2016

Ahh, good catch! Luckily, I don't use SysV struct passing. :)

mpu · on April 24, 2016

I actually decided to take more time to answer your comment more throughly than others. Also, I TA'd twice the class from where you linked the article below, so I know about it :).

nickpsecurity · on April 24, 2016

Great! One of my main reasons posting here is getting next generation of high assurance developers info they need plus learning from them in what's not my specialty (esp formal verification). Just hate missed opportunities given I rarely run into people that even know what the phrase means or why it matters. ;)

mpu · on April 23, 2016

At least, I can try!

And also, we are seeing more and more certified C programs: see the DeepSpec NSF expedition grant, the Verified Software Toolchain, and the CertiKOS project for examples. I work with these guys.

nickpsecurity · on April 24, 2016

You are soooo lucky. DeepSpec has a near dream team of people working on this issue. Appel and Chlipala alone could probably knock out most of the problem given enough time. Add the others and great stuff on publication list is entirely unsurprising. Except in its cleverness. :) Glad you brought it up as I haven't read the info flow for c & asm paper yet.

Btw, hows progress coming on those projects? Specifically, are any of the tools (a) useful for non-experts with a little bit of training by tutorials, etc and (b) available for download in open-source or binary form yet? Thanks ahead of time.

mpu · on April 23, 2016

I take this as a compliment. It means my design choice was totally valid, and maybe even a good one.

mpu · on April 23, 2016

Then maybe I did not express myself in the best terms. QBE definitely supports stack slots and their registerization! Minic, a small C frontend shipped with QBE makes use of them.

The difference is that LLVM forces you to use them even when you know your locals do not escape (i.e. your source language is Pascal), QBE doesn't.

So, LLVM makes you use stack slots for two independent problems: 1. Compiling languages like C where locals can escape, and 2. Avoiding to construct SSA form in the frontend. In QBE, you use stack slots (alloc4, alloc8) to solve 1, but to solve 2, you can simply emit non-ssa form and QBE will fixup things for you.

munin · on April 23, 2016

> you can simply emit non-ssa form and QBE will fixup things for you

That's exactly what LLVM does though, except the "non-SSA form" involves loads/stores to alloca'd values. The difference is that in LLVM, the language is always in SSA form, it just has some reads / writes to memory that can be pruned, while QBE alternates between being an SSA language and not an SSA language.

LLVM also doesn't "make" you use stack slots. If you wanted to, you could emit fully pruned programs as LLVM IR. Using allocas for variables is a choice that clang makes.

mpu · on April 23, 2016

I think the extra load/stores clutter the IL.

Also, QBE does not really "alternate" SSA/non-SSA, SSA form is built once at the beginning of the compilation pipeline and preserved later.

I don't understand what you mean by "fully pruned programs". Maybe you want to refer to pruned SSA form. And then, here is my point: with LLVM, either you build SSA yourself or you use allocas. QBE offers a convenient third option.

sklogic · on April 24, 2016

Some CFG transforms are actually much easier if you get out of an SSA first, reshuffle CFG without caring about maintaining your phis, and then simply rebuild an SSA form.

nickpsecurity · on April 24, 2016

Expected to see you here. Hey, send me an email or something so I can send you interesting stuff without hunting your profile or random comments on HN. Address in my profile. Here's the one I was saving for you to complement the ML and Haskell CPU's I linked.

https://www.info.ucl.ac.be/~pvr/bam_jlp.ps

Cool stuff, eh? That they keep it close to a regular, RISC processor means optimizations of that should carry over. Unlike stuff like Fifth Generation that tried to go way, way the hell to far with Prolog hardware. ;) Should fit nicely into my concept of general-purpose CPU's with purpose-built coprocessors. Also speculate techniques might be helpful on ASIC's meant for today's big-data apps that use things like Datalog for queries. Ya think?

riscy · on April 23, 2016

> I don't understand what you mean by "fully pruned programs". Maybe you want to refer to pruned SSA form.

The mem2reg pass in LLVM is the recommended method of constructing pruned SSA form. It just implements the standard algorithm.

> QBE does not really "alternate" SSA/non-SSA, SSA form is built once at the beginning of the compilation pipeline and preserved later.

The QBE IL presented to the user is not in SSA form, but internally within your compiler, an SSA representation is kept. I prefer to not have syntactic sugar in my IL.

mpu · on April 23, 2016

It's an alternative if you fit in the use case. I did not try to clone LLVM.

dang · on April 23, 2016

We've taken 'LLVM' out of the title above, since experience has shown that discussions about titles tend to be off-topic and/or shallow. We also added 'Show HN' since this is your own work. Good luck!

mpu · on April 23, 2016

Cool, thank you guys!

mpu · on April 23, 2016

It's much much smaller (I think libfirm is over 100kloc, QBE is about 6k).

But the major difference is the IL: I use a human-readable and easily-printable text IL. This means that you don't need a graph-viewing tool to read the IL (it's just text) and that you can modify the IL between two passes super easily. This simple IL is a blessing when debugging a compiler.

I think QBE also has better support for the x64 ABI.

Finally, it is much less advanced (less optimizations, less tested) than libfirm and supports only x64 as a target.

This is a sketchy comparison.

swah · on April 29, 2016

Its probably also easy for you to output a symbol-file to be used by IDE/editors, right?

mpu · on April 23, 2016

Thank you for your words. It is often called NIH, but eh, I learned a lot! And I think that I made some modest improvements over LLVM, you can check them out in my comparison at http://c9x.me/compile/doc/llvm.html

tinco · on April 23, 2016

Looks good! I know your target is 70% of the performance, but is there any fundamental reason to QBE that it couldn't be more? Suppose I ported my compiler from LLVM to QBE (I think I could do so with not too much effort) would at some point I be able to work on porting some of LLVMs optimizing phases to QBE to get my performance up to par, or is there a design decision you made that will get in the way of the last 30%?

mpu · on April 23, 2016

It's only a goal I set to myself, if we can do better, heck let's do it! Keeping the code short, on the other hand, is really something I care about.

tinco · on April 23, 2016

Great, I have the same goal for my compiler. I'd love for the compiler to be able to bootstrap its own backend, and as it only compiles C that would rule out llvm.

I am not sure how far I am along in actually compiling C, if I'd had to guess I'd say around 60%. Hopefully there's not too much crazy things on the horizon.

I'm not a C programmer myself so I first implemented the switch statement in a naieve way, and then I discovered the way they actually work and spent days getting it right.

If I get to the point where I can compile trivial C programs like the benchmarks game, I'll research a move to QBE :)

gravypod · on April 23, 2016

If you modularize/abstract different optimization passes, you could easily keep code clarity high, while still having a very clean set of code.

I've not done this a lot in C but it looks possible: http://stackoverflow.com/questions/384121/creating-a-module-...

mpu · on March 17, 2016

12,000 links is probably too much.

0x4542 · on March 17, 2016

How many do you think is the right amount then?