> Bidi override control characters are clearly not among them, whichever language you choose.
Not sure how would you write a comment in an RTL human language in the middle of LTR code without it. Lots of people write learn RTL languages well before writing any code.
What compilers can do is to process those characters and assign them semantic value that makes the code equivalent to what is expected to be rendered.
Now, bidi overrides in identifier names is a nightmare I’d prefer to avoid.
You do not actually need the bidi override control character to put a comment in an RTL language in the middle of LTR code.
You only need it if you are doing this, and the default Unicode algorithm for guessing LTR/RTL boundaries gets it wrong, so you need to override with an explicit bidi override control. I'm not even sure how feasible that is to do in current editor/IDE environments developers who have this use case might use.
I am genuinely curious how often these sorts of situations come up in actual development.
> What compilers can do is to process those characters and assign them semantic value that makes the code equivalent to what is expected to be rendered.
I don't understand what you mean or how that's even possible, for the kinds of attacks discussed in OP.
Btw here's proof. Here is ltr text and rtl עִברִית text عربي
interspersed with no bidi override control characters to be found.
Unicode can handle this, it has a heuristic algorithm for it. Note how if you try to select the text character-by-character, your selection does funny things at the rtl to ltr boundaries, because the byte order doesn't match the order on the screen. It really is handling the directionality changes, with the letters entered in "order" across changes, there is no funny entry or ordering going on, this is plain old normal unicode handling interspersed directionality changes just fine, with no bidi overrides.
It just sometimes gets it wrong for the intent of the author. Especially when there are characters at the boundaries that are themselves not strongly associated as rtl or ltr (like ordinary "western arabic numerals" or punctuation). That's what the bidi override control char is for.
The same way as you write a comment in a LTR human language in the middle of RTL code - you don't. You stick to either LTR or RTL. This is code, not prose.
Not sure how would you write a comment in an RTL human language in the middle of LTR code without it. Lots of people write learn RTL languages well before writing any code.
What compilers can do is to process those characters and assign them semantic value that makes the code equivalent to what is expected to be rendered.
Now, bidi overrides in identifier names is a nightmare I’d prefer to avoid.