1) the sequence length increases too much. Idk what the average token length is for Llama, but imagine it's like 5+ bytes. Using individual bytes as tokens immediately makes the context 5x longer which is super bad for inference speed and memory requirements (since attention inference is quadratic in the length of the sequence).
2) individual bytes have essentially no meaning, so byte embeddings are harder to learn. Subword tokens aren't a perfect solution, but they definitely often have some standalone meaning where embeddings make sense.
I'll give another example from a recent paper that tries to eliminate tokenizers (this is a popular research direction) [1].
Figure 4 is a really good example of why byte-level models are wasting computation. Once part of a word is generated, most of the remaining bytes are assigned basically probability 1. But a byte-level model would still have to spend time decoding them. With a subword-level model most of these easy-to-decode bytes would be packed together in a single token so you don't have to decode them individually.
When model APIs bill by the token, this is an important consideration.