I recently came across this style of interpreter which V8 and HotSpot use. It works by actually writing the bytecode op handlers in assembly (or some other low-level language) and generating them to machine code, and having a dispatch table mapping bytecode_opcode -> machine code to execute it
The main reason seemed to be that both V8 and HotSpot have an optimizing JIT compiler, and having low-level control over the machine code of the interpreter means it can be designed to efficiently hop in and out of JIT'ed code (aka on-stack replacement). For example, V8's template interpreter intentionally shares the same ABI as it's JIT'ed code, meaning hopping into JIT'ed code is a single jmp instruction.
Anyway, I go into more implementation details and I also built a template interpreter based on HotSpot's design and benchmarked it against other techniques.
> Note that the context pointer came after the “standard” arguments. All things being equal, “extra” arguments should go after standard ones. But don’t sweat it! In the most common calling conventions this allows stub implementations to be merely an unconditional jump.
I didn't know this. Does this optimization have a name? Where can I read more about it?
void free_with_extra_args(int *a, int *b) {
free(a);
}
then *a is in already in the correct slot (the RDI register) for the first argument when free_with_extra_args is being called. Whatever is put into *b is never touched. If you compile this with gcc -O2 you get
free_with_extra_args:
jmp free
If you make the function call free(b) instead, you'll have to move b into the right place before calling free:
Wikipedia also has a nice summary of calling conventions on other platforms like ARM. All modern calling conventions are similar: pass the first args in registers and then use the stack as needed https://en.wikipedia.org/wiki/Calling_convention
That's only one half of it though, the other interesting part is the jmp (instead of call) to hand over control to a subroutine without pushing a new return address to the stack (since there's no code after the function call, and the calling function doesn't require its own stack frame).
This trick stops working as soon as outer_function needs local variables or does anything other than returning the exact return value of inner_function. In that case you need a stack frame
Since the times of K&R, most C calling conventions are defined so that you can call a function with more arguments than it expects. While the ISO C standard does not permit such calls in strictly conforming code, support for them is still ubiquitous.
[One very common use of this is that the C libraries on common desktop systems call main with three (or more!) arguments, and this works whether the programmer has declared an int main(void), an int main(int argc, char *argv), or an int main(int argc, char *argv, char *envp). I once again have to say that the third form, with an envp argument, is not sanctioned by either ISO C or POSIX, and that a C implementation could definitely use some special handling for an external-linkage function called “main” to allow the first two to be used.]
In practical terms, a function call goes along the lines of “pack the first few arguments into caller-saved registers[1] (using more or less complicated rules), then the rest on the stack, then do a call (using either the stack or the link register depending on the architecture), then after returning pop off the stack part of the arguments (the register part being assumed clobbered and not requiring any cleanup).” This kind of convention is called “caller cleanup” (in contrast to “callee cleanup” conventions—like non-vararg __stdcall on Win32—which have the callee pop the arguments off). While you could imagine other ways to permit extra arguments, it’s certainly the most common one.
That, then, implies that a C function can tail call (jump to) any function whose argument tuple is a prefix of its own. Good compilers will recognize this (and unlike you giving a function extra arguments, they are within their standard-given rights to do so!). If you want proper tail calls to functions of other types... that might be possible, but it’s much more gnarly. See for example the description of the musttail attribute in the Clang documentation[2].
[1] For each machine register, a calling convention will define which the callee has to preserve (“callee-saved”) and which the caller will have to if it cares about what’s in them (“caller-saved”).
I'm not sure if there's a particular name for it, but you could consider it to be a special case of a tail call optimisation https://en.wikipedia.org/wiki/Tail_call
I feel like compiling regex to bytecode and executing it should be faster than dealing with big graph like structures with NFAs/DFAs
Also, V8's regex engines are very fast because they get JIT compiled to machine code, so wouldn't bytecode just be one step higher in abstraction, and have worse but similar performance?
Before getting to your actual question, it might help to look at a regex benchmark that compares engines (perhaps JITs are not the fastest in all cases!): https://github.com/BurntSushi/rebar
In particular, the `regex-lite` engine is strictly just the PikeVM without any frills. No prefilters or literal optimizations. No other engines. Just the PikeVM.
As to your question, the PikeVM is, essentially, an NFA simulation. The PikeVM just refers to the layering of capture state on top of the NFA simulation. But you can peel back the capture state and you're still left with a slow NFA simulation. I mention this because you seem to compare the PikeVM with "big graph structures with NFAs/DFAs." But the PikeVM is using a big NFA graph structure.
At a very high level, the time complexity of a Thompson NFA simulation and a DFA hints strongly at the answer to your question: searching with a Thompson NFA has worst case O(m*n) time while a DFA has worst case O(n) time, where m is proportional to the size of the regex and n is proportional to the size of the haystack. That is, for each character of the haystack, the Thompson NFA is potentially doing up to `m` amount of work. And indeed, in practice, it really does need to do some work for each character.
A Thompson NFA simulation needs to keep track of every state it is simultaneously in at any given point. And in order to compute the transition function, you need to compute it for every state you're in. The epsilon transitions that are added as part of the Thompson NFA construction (and are, crucially, what make building a Thompson NFA so fast) exacerbate this. So what happens is that you wind up chasing epsilon transitions over and over for each character.
A DFA pre-computes these epsilon closures during powerset construction. Of course, that takes worst case O(2^m) time, which is why real DFAs aren't really used in general purpose engines. Instead, lazy DFAs are used.
As for things like V8, they are backtrackers. They don't need to keep track of every state they're simultaneously in because they don't mind taking a very long time to complete some searches. But in practice, this can make them much faster for some inputs.
Hey HN, I got inspired by a recent hack someone did of creating flappy bird in MacOS Finder, so I tried to do the same with type-level Typescript.
If you don't know, Typescript's type annotations are Turing complete, allowing you to compute anything with some clever type trickery.
In order to render graphics from Typescript's types, I ended up creating a compiler that compiles Typescript's types into bytecode, and a custom VM that can execute that bytecode.
Each frame, the VM takes draw commands from type-level Typescript, and renders it using the configured graphics backend. It can run in the browser by compiling the VM to Wasm and using the web's Canvas API.
I don't do anything related to data science, but I feel like doing it in Rust would be nice.
You get operator overloading, so you can have ergonomic matrix operations that are typed also. Processing data on the CPU is fast, and crates like https://github.com/EmbarkStudios/rust-gpu make it very ergonomic to leverage the GPU.
I like this library for creating typed coordinate spaces for graphics programming (https://github.com/servo/euclid), I imagine something similar could be done to create refined types for matrices so you don't do matrix multiplication matrices of invalid sizes
> These "clean code" principles should not, and generally are not, ever used at performance critical code, in particular computer graphics.
I agree that this is mostly true, but maybe not for beginners in the field
When I was reading "Raytracing in One Weekend" (known as _the_ introductory literature on the topic), I was very surprised to see that the author designed the code so objects extend a `Hittable` class and the critical ray-intersection function `hit` is dynamically dispatched through the `virtual` keyword and thus suffers a huge performance penalty
This is the hottest code path in the program, and a ray-tracer is certainly performance critical, but the author is instructing students/readers to use this "clean code" principle and it drastically slows down the program.
So I agree most computer graphics programmers aren't writing "clean code", but I think a lot of new programmers are being taught them because of introductory literature
I have always found CRDTs a lot easier to wrap my brain around. Agreed with Joseph that optimizing them is non-trivial. In Teletype for Atom, we indexed fragments in splay trees, and each text fragment was actually a member of two splay trees at the same time: one for the full document, another for the pieces of the original insertion as they'd been split apart. We would find the insertion fragment in the insertion tree, then walk parent pointers in the document tree to locate it. In Zed, we have two trees, but can't do this double embedding because we use persistent data structures where parent pointers are not an option. So we instead have insertion trees that map subslices of inserted text to fragment ordered identifiers, which we use for lookup in the main document B-tree. These fragment identifiers were themselves inspired by another CRDT paper called Logoot, but we don't actually send them over the wire.
Both OT and CRDT approaches have lots of obscure edge cases. In both cases, I can't write a correct algorithm without a fuzzer to save myself. (And I don't trust anyone else to, either).
CRDTs aren't that complex on the surface - this messy file[1] implements 4 different list CRDT algorithms in a few hundred lines of code. And CRDTs are pretty simple for JSON structures, which is not at all true for OT algorithms.
But text CRDTs in particular need a whole lot more thought around optimization because of how locations work. Locations in list/text CRDTs generally work by giving every item a globally-unique ID and referencing that ID with every change. So, inserts work by saying "Insert between ID xxx and yyy". But, if you have a text document which needs to store a GUID for every keystroke which has ever happened, disk and memory usage becomes crazy bad. And if you don't have a clever indexing strategy, looking up IDs will bottleneck your program.
In diamond types (my CRDT), I solve that with a pancake stack of optimizations which all feed into each other. Ids are (agent, sequence number) pairs so I can run-length encode them. Internally those IDs get flattened into integers, and I use a special hand-written b-tree which internally run-length encodes all the items it stores. I've iterated on Yjs's file format to get the file size down to ~0.05 bytes of overhead per inserted character in some real world data sets, which is wild given each inserted character needs to store about 8 different fields. (Insert position, ID (agent + seq), parents, the inserted character, and so on).
CRDTs aren't that complex. Making them small and fast is the challenge. Automerge has been around for years and still takes hundreds of megabytes of ram to process a simple editing trace for a 10 page text document. Even in their rust implementation. Diamond types is orders of magnitude faster, and uses ~1M of RAM for the same test.
The current version of diamond types is many times faster than any OT algorithm I've written in the past thanks to all those optimizations. Now that I've been down this road, I could optimize an OT algorithm in the same way if necessary, but you don't really need to.
Other kinds of CRDTs don't need these optimizations. (Eg the sort you'd use for a normal database). There's two reasons: 1. When editing text, you make an edit with each keystroke. So you have a lot of edits. 2. Merging list items correctly is orders of magnitude more complex than merging a database entry. If you just want CRDTs for editing a database of records, that'd probably only take a few dozen lines.
Interesting that the wiki says that the performance enhancing effect comes from a lower heart rate, which is something you can influence with practice.
You can lower your resting heart rate naturally through exercise over time, and meditation and breathing techniques with practice give you control of your heart rate in the moment.
I was pretty intrigued. How does it compare to techniques which require way less engineering cost like switch+loop, direct-threaded, and only using tail-calls (https://blog.reverberate.org/2021/04/21/musttail-efficient-i...)?
The main reason seemed to be that both V8 and HotSpot have an optimizing JIT compiler, and having low-level control over the machine code of the interpreter means it can be designed to efficiently hop in and out of JIT'ed code (aka on-stack replacement). For example, V8's template interpreter intentionally shares the same ABI as it's JIT'ed code, meaning hopping into JIT'ed code is a single jmp instruction.
Anyway, I go into more implementation details and I also built a template interpreter based on HotSpot's design and benchmarked it against other techniques.
reply