I think you meant to say that tokenization is usually done with UTF-8 and a sing...

I think you meant to say that tokenization is usually done with UTF-8 and a single Japanese character generally takes 3 or more code units (i.e. bytes). Unicode itself is not the culprit (in fact, even with UTF-16 tokenization, most Japanese characters would fit in a single code unit, and the ones that won't are exceedingly rare).

I have to admit I have not encountered significant mistokenization issues in Japanese, but I'm not using it on a daily basis LLMs. I'm somewhat dobutful this can be a major issue, since frontier LLMs are absolutely in love with Emoji, and Emoji requires at least 4 UTF-8 bytes, while most Japanese characters are happy with just 3 bytes.