This is a nice attitude. I think HN is overall pretty nice for geeking out and also hearing other people geek out, but there is still a strain of elitism (not like StackExchange thankfully) and so I'm happy to see comments like this.
Those types can't help themselves so patterns emerge and usernames become recognizable after a while. There are some people who I just don't bother engaging with any more. Of course, those experiences are my own and maybe not the same experience as others.
Basically, it just selects all substrings that appear in the corpus up to a certain length (and then maybe trims it a little by discarding rare substrings or something to reduce the initial size a bit and make things faster).
If the library has two vocabulary learners, only one of which does the described thing, then isn't it unambiguous which implementation within the library the question refers to? And wouldn't it be ambiguous to instead say "how does Unigram do it" without referring to any particular implementation?
Anyway, the paper says "Frequent substrings can be enumerated in O(T) time and O(20T) space with the Enhanced Suffix Array algorithm (Nong et al., 2009)", which is hilariously underspecified, at least in part because a suffix array algorithm isn't a top-k algorithm.
People may also be interested in Pynini [1], a python wrapper (+ a lot of additional ease-of-use functionality) of OpenFst [2] (a really great library for transducers).
There are some good tutorials in the form of homework assignments (from like Johns Hopkins and some others) that go through Pynini use cases.
You might be interested in the Recurse Center (https://www.recurse.com/) and the experiences of people who have gone through it (they heavily encourage blogging about your time there so there is lots to read).
Note: I am not affiliated with the Recurse Center, just a big fan.
1) the sequence length increases too much. Idk what the average token length is for Llama, but imagine it's like 5+ bytes. Using individual bytes as tokens immediately makes the context 5x longer which is super bad for inference speed and memory requirements (since attention inference is quadratic in the length of the sequence).
2) individual bytes have essentially no meaning, so byte embeddings are harder to learn. Subword tokens aren't a perfect solution, but they definitely often have some standalone meaning where embeddings make sense.
I'll give another example from a recent paper that tries to eliminate tokenizers (this is a popular research direction) [1].
Figure 4 is a really good example of why byte-level models are wasting computation. Once part of a word is generated, most of the remaining bytes are assigned basically probability 1. But a byte-level model would still have to spend time decoding them. With a subword-level model most of these easy-to-decode bytes would be packed together in a single token so you don't have to decode them individually.
When model APIs bill by the token, this is an important consideration.
Hi, I'm Cognetta from the above Cognetta et al. I can't answer all of your questions (and I can't speak for the authors of this paper ofc), but I will try to answer some.
> Is a tokenizer that maximizes the compression of text (e.g. by identifying longer tokens that tend to be used whole) necessarily a better tokenizer, in terms of overall model performance? Compression might be a useful property for an objective function to consider... but then again maybe not, if it makes the problem NP-hard.
Compression isn't necessarily the best metric for language modeling quality [1][2][3], but there are some papers that find a correlation between it and quality [4] and also it has one important benefit: it reduces inference time by making the input sequences shorter (this is particularly important for transformers, because the runtime is quadratic in the sequence length).
If you imagine that with enough data, basically any reasonable tokenization algorithm would be ok (I think this is mostly true; there are definitely bad and "better" tokenizers and you see this very clearly in small data settings, but once you get into the trillions-of-tokens and 10s-of-billion-of-parameters setting, other things are going to matter more), then optimizing the tokenizer for compression is a good choice as it will provide tangible, practical benefits in the sense of reduced inference time.
> I'm also not sure how realistic the limitation to "at most δ symbols" is. [...] But why not just keep adding tokens as needed, rather than imposing any preordained limit?
This is a pretty realistic limitation imo. Of course you can arbitrarily increase the vocabulary size, but there is a tradeoff between modeling quality, parameter count, and inference time. If you increase the vocabulary a bunch, your inference speed will probably improve (although now you have a much larger softmax at the end of your model, which isn't usually a bottleneck anymore, but still not great), parameter count will increase (due to the larger embedding table), and your modeling quality will go down (in that you have tokens which are so rare in the corpus that they are massively undertrained; this can cause big problems [5]).
So by constraining it to δ, you are basically setting a parameter budget for the vocabulary, and this is a pretty reasonable thing to do.
> IIRC OpenAI's tokenizer has a vocabulary of around 52k subword strings.
Yeah, the size of the vocabulary varies a lot across models, but it isn't unusual to see significantly larger vocabularies these days (e.g., gemma has ~256k). However, these are still finite and very small compared to the corpus size.
> How could you possibly choose a meaningful δ from first principles?
This is a really great question, and something that we don't know how to answer. A lot of work has tried to answer it [6][7], but it is very much an open question.
Sorry actually I misread part of your comment in relation to the paper and confused δ and another parameter, K.
To clarify, δ is the number of tokens in the tokenized corpus and K is the size of the vocabulary.
So, if you are asking about why would they limit _K_, then my answer still applies (after swapping δ for K). But if you still mean "why do they pick some arbitrary δ as the limit of the size of the tokenized corpus", then I think the answer is just "because that makes it a decision problem".
Our paper [1] is kind of a goofy adversarial thing where we thought "here's this cool metric, how can we break it?". The tokenizers we propose are definitely not tokenizers you should use in practice.
The original paper that proposes the metric is, imo, much more interesting theoretically [2].