What cost do they incur while tokenizing highly mistyped text? Woof. To later decide real crap or typ0 cannoe.
Trying to remember the article that tested small inlined weirdness to get surprising output. That was the inspiration for the up up down down left right left right B A approach.
There are multiple people claiming this in this thread, but with no more than a "it doesn't work stop". Would be great to hear some concrete information.
I think OP is claiming that if enough people are using these obfuscators, the training data will be poisoned. The LLM being able to translate it right now is not a proof that this won't work, since it has enough "clean" data to compare against.
If enough people are doing that then venacular English has changed to be like that.
And it still isn't a problem for LLMs. There is sufficient history for it to learn on, and in any case low resource language learning shows them better than humans at learning language patterns.
If it follows an approximate grammar then an LLM will learn from it.