Can anyone explain to a layman like myself why it wouldn't be better if we just ...

mcyc · on Dec 23, 2024

It's two problems:

1) the sequence length increases too much. Idk what the average token length is for Llama, but imagine it's like 5+ bytes. Using individual bytes as tokens immediately makes the context 5x longer which is super bad for inference speed and memory requirements (since attention inference is quadratic in the length of the sequence).

2) individual bytes have essentially no meaning, so byte embeddings are harder to learn. Subword tokens aren't a perfect solution, but they definitely often have some standalone meaning where embeddings make sense.

I'll give another example from a recent paper that tries to eliminate tokenizers (this is a popular research direction) [1].

Figure 4 is a really good example of why byte-level models are wasting computation. Once part of a word is generated, most of the remaining bytes are assigned basically probability 1. But a byte-level model would still have to spend time decoding them. With a subword-level model most of these easy-to-decode bytes would be packed together in a single token so you don't have to decode them individually.

When model APIs bill by the token, this is an important consideration.

[1]: https://arxiv.org/abs/2412.09871

sureglymop · on Dec 24, 2024

Thank you very much for the thorough reply! I highly appreciate it.