A long while back, the first time someone decided to add content filtering to the corporate proxy every local news site would be inaccessible on Fridays, because in Portuguese “Sexta-feira” is abbreviated as “Sex.”
You’d be surprised how many variations of this I’ve come across over the years.
Surely, that is a solved problem now that Google has "advance[d] the state of the art in natural language technologies and buil[t] systems that learn to understand and use language in context"? [1]
Or, I guess, one can just buy more NVIDIA GPUs, and simply solve the problem using Torch, following their online guide on "understanding Natural Language with Deep Neural Networks Using Torch" [2].
In general, I hear that many language understanding tasks are finally conclusively conquered, as shown by OpenAI with a combination of supervised and unsupervised learning that has led to an ImageNet-class increase in the state-of-the art results in language understanding datasets [3].
Why, these days, we can even build an "AI System [that] Understands Music Like Humans Do"! (NVIDIA, again; [4])
The inhabitants of Scunthorpe and all the Dicks, Pennys and Dikshits of the world can finally rejoice! Machine learning is on the job!
Pffft, wake me up when these cowabunging AI witches can draw inferences from phono-semantic improvisation on the part of us brain-endowed motherslammers.
(Gotta say, it's way easier to ad-lib this stuff in inflection-heavy languages, or at least one of them.)
Not to be too cliche, but a sufficiently advanced AI would be able to solve this problem, because it would have the ability to recognize context as humans might. The more interesting question will be what such an AI would do with words that native speakers of a language do not agree on e.g. in English words like "crap" or "damn." I think the future problem will be AIs that are either too conservative or too liberal when it comes to censoring "mild" vulgarities (and then I will probably say, "a sufficiently advanced AI would adjust the censorship level by the audience to avoid that problem," etc.).
I am not sure it is "magic" so much as "still beyond the state of the art." I would be surprised if we never get to the point where AI can handle that kind of nuance, especially given how much progress has been made just in my lifetime. Maybe it will take longer than some expect, but it is not an impossibly hard task.
My wife and I noticed this very clearly when trying to use Dutch in the text chat in Elder Scrolls Online. A lot of words were censored because they were close to blacklisted words in English. The most frequent one was "kunt", second person present tense of "can".
Classics like that "everyone knows about" are still new to many thousands engineers every year and definitely worth bringing out for an airing every now and again!
Afterwards gently point them towards the familiar examples of passwords (“Passwords can't contain spaces/be longer than 20 characters/must …”), email address validation (“I found this really complete regex that …”), names (“No, surnames are always at least 3 characters long, and …”), and sex/gender (“Haha, yeah I know gender isn't a binary option, but this is the sex field, it just records what's in the user's passport, and THAT is either male or female!”¹).
Speaking of which. About 8 years ago I casually mentioned the diet coke and menthos thing to one of my closest friends. He had no idea what I was talking about. I got really excited and told him that we were going to go buy some immediately. He seemed really confused and sceptical for the whole trip there and back. In the end we had a wonderful time.
I'm from Scunthorpe! Never thought i'd see the day it made it to Hacker News...
To add something relevant to this post - I'm too young to remember this "problem" regarding the AOL filter, and my parent's first ISP was NTL which is now Virgin Media. Of course the use of the town name with emphasis on the aforementioned profanity was common amongst the youth.
A boy in my class at school was from Scunthorpe. There were a couple of times the IT teacher came to investigate what he was doing, which would usually include the "hometown" on a forum profile.
I moved to Sheffield for University and then stayed for a couple years after for work. Have actually ended up in Leicester recently to live with my girlfriend. My job is quite forgiving in terms of location as I'm on the road quite often, or can work from home!
I'm not the person you were replying to, but I went to university in Manchester, worked there, then Oxford, Cambridge and then to the US. Currently in Seattle.
(I did do a stint in the IT department at North Lindsey College, which was... different.)
When I see discussions of these filter/namespace problems I am reminded of when a new coworker caused the administrators to reconsider their account naming policy because his last name is “Root”.
One of the most popular FIFA (as in the soccer video game) sites has the most naive filtering, so terms like "assist" and "passing" appear as "...ist" and "p...ing" which makes the latter look more like "pissing" :-D
That reminds me of the clothing brand "Lonsdale", which contains "NSDA". So if you wear a jacket, you can hide the other letters. It was worn a lot in the German neo-nazi scene because Hitler's party was called "NSDAP".
While it is an accident with "Lonsdale", and they actively did a campaign against it in the early 2000s, someone created a brand named "Consdaple", which contains the full "NSDAP".
I remember seeing someone on a night out who had removed the L, the S and the D. Actually, I'd never noticed that those removed letters have a meaning too so it was possibly a much deeper message.
One of the best examples of this was with the Pokemon games, where the creators had tried to block offensive words in nicknames without realising that some of their own characters had names that trigged the filters by default.
So for a while, you had a situation where players couldn't trade Cofagrigus, Froslass or Marshtomp on the GTS (Global Trade System) because the default species name set off the swear filter.
Hell, in one case, one you got traded in game couldn't then be traded online, since the nickname the NPC used would trigger the filters and wasn't usable by players.
It also shows the folly of trying to do this with multiple languages at once too, since the filters block anything with a name that's offensive in any language the games are released in. So in some cases, you see Japanese players blocked from using a name because some characters match a swear word that's apparently a thing in France or what not.
Makes you feel sorry for the folks at Game Freak having to name these characters in the first place, or the translators trying to make them work in other languages. Imagine trying to name 100 new characters per game in a way that doesn't contain any unintentionally words in any of the 9 or so languages the games are released in...
Back in my days working in Infosec at a bank, we had a naive filter on Internet bound mail looking for "bad words".
I had to review the blocked mails if there was a query about why it was blocked.
One time I came across a mail and couldn't for the life of me work out the reason, I ended up writing a script to match on the wordlist to find the problem.
Turns out the troublesome phrase was "Don't be too hard on yourself"
When I was at University of Cincinnati they had a "recreational computing" mailing list. One day some CS people decided to build a cocktail table video game console using MAME or something like that. This rapidly became the MALE GENITALIAtail table (yes, all caps) due to some kind of obscenity search and replace on the list. The name stuck and there was a MALE GENITALIAtail table in the student union for a while.
I remember getting an autoban on a Twitch chat for using a word that happened to start with the letters "paki." That word was "Pakistani." So apparently just being from Pakistan is the part that's offensive.
On the Wikipedia page for Scunthorpe (not the problem) the etymology of the name is given. Seems that we wouldn't be having the Scunthorpe Problem if the people of Scunthorpe could be bothered to spell their town's name correctly:
The town appears in the Domesday Book (1086) as Escumesthorpe, which is Old Norse for "Skuma's homestead", a site which is believed to be in the town centre close to where the present-day Market Hill is located.
If they changed the 'c' for 'k' then they wouldn't get modded down in the AOLs of this world and nobody outside of Lincolnshire, England would have any reason to even know of the place.
Or we could not unilaterally force name changes on account of technical limitations and software implementation faults. Only a right cunt would do that.
TL;DR: enumerating badness doesn't work, and they're Holding It Wrong.
Right. "If everyone would just bend over backwards, my current self-inflicted problem would be solved."
Now go and tell your thing to all those towns east of 14E named "Horny $X". Never mind that the names are correct, that it means "Upper $X": there's an entitled programmer on HN whose naive filter doesn't match reality - so of course the solution is to bend reality to these arbitrary demands: GET TO WORK LAZY SLOBS!
(After you're done, there's even more travel for you: https://mashable.com/2016/11/22/world-map-rude-place-names/?... . Be warned though: residents of Fucking, Innviertel, Austria have had it up to here with all those English and Merkins being righteously offended about a name that's completely innocuous in the local language.)
My point was more to do with thinking differently, not knee-jerk reacting to some click bait about rude town names.
We all know of the 'Scunthorpe Problem' (at least on HN) but nobody cares to then think why Scunthorpe is called Scunthorpe. Even the Wikipedia page on the 'Scunthorpe Problem' doesn't link to the other Wikipedia page on Scunthorpe that divulges the etymology of the name.
Had the people there not changed their town's name then they would have been fine. None of the Anglo-Saxon four letter words are new, they could have avoided the problem people have with the town name.
Before the internet there was witty graffiti to be found in places such as public conveniences. Graffiti wasn't just inane tags in those days, people had something to say, including 'who put the [redacted] in Scunthorpe?', a 'quote' I remembered because it was funny. So the 'problem' is not a new one, it has been around for a long time, pre-dating the internet and iphones by many, many decades.
Clearly there is more to the etymology than what was handed down with the Domesday Book, there is centuries of language evolution plus the pride a town has in its name.
So there is no trite 'holding it wrong' point here and there certainly isn't any egotistical programmer telling the world to name their towns in American English friendly unicode, with all humour and culture removed. Just a not-mentioned point on how Scunthorpe used to be spelt differently, and, had this original spelling stayed the course then the town would not be the poster child for badly named towns.
You’d be surprised how many variations of this I’ve come across over the years.