Homoglyphs are DISGUSTING
This is not the brexit I voted for.
What's old is new again as infosec bods are sounding the alarm over a fresh wave of homoglyph characters being used to lure victims to malicious fake websites. Researchers at Soluble today said they worked with Verisign to thwart the registration of domain names that use homoglyphs – non-Latin characters that look just like …
Have some sort of rule that if it looks the same, it is the same
For example, if I were to visit TheRegister.co.uk, the user agent would convert it to theregister.co.uk before doing the dns lookup.
Likewise, it should be possible for the user agent to convert thеregister.co.uk before looking it up. (difference in my example is first e is taken from the cyrillic alphabet)
right - OCR "homonyms" should all translate to the appropriate charset before name lookups happen, or at least before registrars accept them as non-duplicates.
And doing periodic name cleanup might be a good idea, requiring takedowns of any domain that's a lookalike (and assuming they're being used for fraud).
So basically construct a map of UTF-8 chars to ISO8859-1 lookalike chars, then run every domain name through that matrix, see if duplicates show up.
I assume other-than-english lingos might need something similar.
Not really. For the font to be usable, it has to make glyphs look recognisable to a native user of the script. Scripts, in turn, have a habit of containing (for example) a letter that looks like a small circle. You can't make that look different from another small circle without making at least one of them look wrong.
On the other hand, a mixture of scripts within the same part of a domain name is almost certainly dodgy, so there does appear to be an easy way for browsers to detect the fraudsters.
there does appear to be an easy way for browsers to detect the fraudsters
I'm not so sure about that. Back in the days of "code pages" numeric values were interpreted in a cultural context to determine the glyph they represented. These days, Unicode code points just represent glyphs, there isn't any real concept of which "script" they represent - related characters are grouped into blocks, but there may be multiple blocks associated with a particular language group - and more can be added over time. Some glyphs are omitted from certain blocks because they already exist elsewhere. And there's no reason to impose a restriction on any domain that it can only use the glyphs common in one particular culture - combining, say, Chinese characters with "arabic" numerals. There have also been complaints that Unicode fails adequately to distinguish superficially similar glyphs from different languages - particularly in Asia. Simple rules won't help you work out which Unicode strings are likely to be deceptive in general: they'll only tell you which ones aren't ASCII.
The Unicode standard tries very hard to make characters that look the same to map to the same Unicode character across many scripts. e.g. Many Han (CJK) characters belong to both the Japanese and the Chinese character sets and have the same Unicode.
In fact, most scripts use the same numeral system.
However, as this article points out, it is still possible to have two very similar looking characters with different codes. It just slips through or it just happened that something that looks like an 'o' also exists in another script. To pull the 'o' from one script into another can create a pockmarked character set making many string operations difficult; (is 'o' < 'p') if 'o' is pulled in from another script?
There are also political ramifications.
Pulling in characters from other, similar scripts, can create a sudden rise in the temperature of the injured party. However, if the characters look similar, it is probably because they probably fought a few battles leading to an exchange of ideas and knowledge.