Beware of Greeks bearing spammy small omicrons, says Google

David Cantrell

The Unicode consortium's permitted combinations of character sets are a bit odd. Their "highly restrictive" level will ban identifiers that mix Latin and Arabic or Latin and Hebrew or ... well, you get my drift, while explicitly permitting Latin and far eastern scripts. That's just weird. It's fairly common to see text in random scripts with Latin numerals embedded in them.

And in the "moderately restrictive" level they single out the combinations of Latin + Cyrillic and Latin + Greek, which would ban things like Αθήνα2004 or Со́чи2014.

I know why they're doing it, but they're still going to end up hitting an awful lot of legitimate addresses and domains with this.

