back to article Hey, AI software developers, you are taking Unicode into account, right ... right?

Computer scientists have detailed ways in which AI language systems – including some in production – can be hoodwinked into making bad decisions by text containing unseen Unicode characters. Account numbers can be switched around, recipients of transactions changed, and comment moderation bypassed by special hidden characters …

  1. MarkET


    Reminds me of an old 'bit toggling' trick in image files for sending messages. Specific bit offsets would constitute a message without affecting the visual representation of the image. Easy to get a 1K message into a hi-res photo.Even better it those bits contain a new offset for the next image...

    1. codejunky Silver badge

      Re: Manipulation


      Ahh steganography. great fun to play with back in uni

      1. codejunky Silver badge

        Re: Manipulation

        How on earth have I upset 4 idiots with my comment? Stupid bots

  2. logicalextreme

    I've seen (on Usenet, I think) .exe files disguised as .jpg or .txt files through clever use of tricks like these. I only figured out what was going on when I used Python to take a look at the actual filenames, because they fooled both Windows Explorer and cmd. I was pretty impressed!

  3. Irongut

    You don't need invisible unicode characters just use Google Translate normally and it will mangle numbers, translate people's names (sometimes but not all the time), change the results of sporting competitions, change the tense and even change the meaning of some sentences.

    It also tries really, really hard to infer an American context to all things which can change the meaning of a sentence, when it doesn't just render it completely incomprehensible.

    I use Google Translate daily and every time I do it proves we don't need to worry about the AI uprising any time soon.

    1. katrinab Silver badge
      Paris Hilton

      And my favourite, every time Google Translate comes up.

      An Italian document that included a list of countries. One of those countries was "Macedonia" before it changed its name to North Macedonia. Google translated it as "Fruit Salad".

      You will find tins of macedonia for sale in your local Italian supermarket, but humans would understand the context and realise it was referring to the country rather than the food item.

    2. find users who cut cat tail

      > we don't need to worry about the AI uprising any time soon

      AI uprising has never been the problem. We will destroy ourselves using dumb machine learning long before getting close to developing a general AI.

      The way it is going someone soon nukes a country while trying to erase an old Johnny Cash collection…

    3. Dan 55 Silver badge

      Tried DeepL? The output is usually better than Google Translate's.

  4. Jonathan Richards 1 Silver badge

    Nothing new under the sun

    Good heavens! It's like these people never saw posters writing control codes into their Usenet contributions. Click up arrow now^W^W^W^W

  5. Citizen of Nowhere

    Back when I did tech support, a lot of the bugs we came across were related to text encodings and how they were (mis)handled going between different applications and systems. Once got a box of beer sent to me from Norway after I managed to identify a bug with the handling of the thorn character used in Norwegian and get it squashed. I guess I don't have the right mindset (cybersecurity researcher/cybercriminal) because I would never have figured out how to systematically exploit them.

    1. Paul Herber Silver badge

      Thorn (Þ, þ) and Eth (Ð, ð) are in Icelandic (and Eth is in Faroese), not Norwegian. (sorry to be a pedant).

      1. Jan 0

        Nevermind, they have some crackin' beer in Iceland too.

    2. Rtbcomp

      I once wrote a Z80 machine code program to run on a CPM machine using only the instructions that corresponded to ASCII characters. Typed it in as .TXT file and renamed it .EXE and ran it.

      1. MarkET


        Pedantic but need to be a .COM to run on CP/M

        Good old Mark Zibowski defined the EXE format for Microsoft. They all still carry his initials in the first block.

      2. It's just me

        The EICAR AV test file is a ASCII string that can be renamed with .com and is a valid DOS program.

  6. katrinab Silver badge

    Spammers have being doing this for years, and spam detectors seem to be generally able to deal with this, probably by treating any email that contains such characters as spam.

    1. find users who cut cat tail

      > treating any email that contains such characters as spam

      Easy if you only speak English.

      If your native language is RTL on the other hand…

    2. DS999 Silver badge

      You can treat some of it as spam

      There's no legitimate reason to have that code that turns 1234 into 4321 in an email, or the code that backspaces (U+8 is just an ASCII 8 or ctrl-H) I don't think it would be unreasonable for a mail client to refuse to send such an email, or a mail server to bounce it back to you as potential spam (so long as it tells you what its issue is)

      For stuff like the many many variations of the letter 'a' in all the charsets, the spam classifier can just turn any 'a' (possibly including ones with umlauts etc.) into a standard 'a' as far as matching goes. That way if you are trying to match the word "spam" you will catch no matter what 'a' it is written with.

      Not sure why that code that reverses the direction of text is a part of Unicode - wouldn't you just write that way if you're using a right to left language? That should be the role of the document layout language, i.e. HTML, PDF, TeX, etc. not the character set! No wonder Unicode is such a disaster for trying to enforce security, prevent spam etc.

      1. damiandixon

        Re: You can treat some of it as spam

        The code to reverse the direction is used when you mix left and right written languages.

        For example Hebrew with numbers or actually names of streets in Jerusalem...

        1. Yet Another Anonymous coward Silver badge

          Re: You can treat some of it as spam

          >For example Hebrew with numbers or actually names of streets in Jerusalem...

          Probably easier to just rename all the streets in Jerusalem than sort out the Unicode.

          Unicode committee can get really political

        2. DS999 Silver badge

          Re: You can treat some of it as spam

          That still seems like something that should be handled by the layout. i.e. if you are typing an email in Hebrew, and you click the "this bit here should be shown left to right" option it will force the email to be formatted in HTML and use HTML's layout options to handle it.

          While there may have been a role for it when Unicode was created (when people might be passing raw text around) it could easily be banned by email clients/servers today without negatively impacting anyone.

          1. Anonymous Coward
            Anonymous Coward

            Re: You can treat some of it as spam

            I've implemented this algorithm (UAX9), and know it very well. Broadly yes, it is usually handled by layout, with the HTML BDO etc tags to modify the layout algorithm where required. But the algorithm is still based on control characters, and those tags are effectively just a shorthand way to produce those characters.

            Control characters are required because not all text on the web is marked up. If you want to mix english and hebrew in a text box, or in any section of non-marked up plain text - say a plain text email - then they're required.

            Does the layout algorithm need to allow this sort of manual override? Yes. I can't do better than the guide at to explain why.

          2. John Jennings

            Re: You can treat some of it as spam

            That sounds Klugey and wishful thinking from a practical standpoint.

            That particular Unicode was used as an example - there are others that do interesting things to accommodate other formatting which isnt US/English specific. It would be difficult/impossible to consider all the possibilities for high jinx

            Even with this particular one. for example, you might write an email in English - then add an address in Hebrew.

            A (legitimate) user wouldn't be typing the unicode in manually - they would use some keyboard shortcut or selecting a keyboard in the os.

  7. Metro-Gnome

    Long time to find

    I had a unicode issue when code written, complied and running smoothly on US, UK and French machines would not on machines in Japan. It turns out the U+300A and U+300B some nice person had used for their comments was converted to a local《 and 》which parsed completely differently. Caused a lot of global head scratching and the universally popular "it works on my machine".

  8. Robert Grant Silver badge

    When I was in junior school we would spell out "s h 1 t" rather than say a rude word. This feels similar.

    1. A.P. Veening Silver badge

      When I was in junior school we would spell out "s h 1 t" rather than say a rude word. This feels similar.

      It is more like the IT comment about an ID-ten-T error.

    2. Paul Hovnanian Silver badge

      That was just to keep from getting caught by the P.C. N@zis.

  9. NetBlackOps

    Not a new consideration here. Input sanitation has always been a core consideration in everything I do. I adopted it due to my work in statistics and artificial intelligence. GIGO.

  10. nautica Bronze badge

    ♫"That Old Black Magic Has Me In Its Spell..."♫

    Two points--

    One has to wonder: exactly WHY the primary emphasis here is on "AI", and "neural networks"?

    The only obvious--and also. at the same time, very subtle--reason is "pandering"; pandering to the fact that "Artificial Intelligence" and "neural networks" are subsets of that very large set of current psychological 'trigger words' which are, for the most part, meaningless and totally misunderstood by the majority, even of the very technically and scientifically astute, but which generate, almost automatically, a large number of "clicks"--and hence, revenue--for the publication.

    [Other "click-bait-generator (CBG)" words/phrases include, but are not limited to, "quantum computing" (the REAL biggie right now, even though its supposed experts cannot even define it); "machine learning"; "cloud computing"; "room-temperature superconducting"; "super-string theory"...well, you get the idea. And yes: sadly, when some 'author' encounters 'writer's block' simultaneously with a looming deadline, (s)he will trot out that oldie but goodie, sure-fire, CBG: cold fusion. ]

    The second point is similar to the first--the use of Unicode as the 'perpetrator' of the indignities highlighted here. This is simply one more use of a 'buzzword' to generate page-clicks. As many of the comments here have pointed out, all forms of trickery were performed with existing code-sets long before Unicode became a 'thing'.

    "Extended ASCII Character Set" simply doesn't have the same ring to it as "Unicode". Nor the CBG potential.

  11. The Dark Side Of The Mind (TDSOTM)

    Of all the names...

    "Nicolas Papernot, co-author of the paper and an AI security researcher"

    Paper NOT Nicolas?

  12. ShadowSystems Silver badge

    I want an emoji filter.

    They're pictures. I'm totally blind. Thus they are utterly worthless. No professional business should ever use one in what is supposed to be plain text email. I just want a filter that sends all emoji-infested emails to my junk mail folder as the unwanted garbage they are.

  13. matthewdjb

    People have been using such tricks to try to bypass moderation on the forum I moderate for years.

    Automatic temporary ban. To start with.

    The trouble with trying it on the Times automotive automoderation (for example) is that some begger is going to report it to the humans.

  14. Anonymous Coward
    Anonymous Coward

    Obligatory XKCD

    Reminds me of Little Bobby Tables. But that's for a data­base.

  15. Paul Hovnanian Silver badge

    Poor error handling

    If the AI or display app doesn't know how to handle a character, it shouldn't ignore it. There are "unknown encoding" symbols (question mark in a diamond or rectangle with hex bytes) made just for this purpose. AI can be written to kick stuff out that it can't parse rathet than sending people to the wrong Arabic street number.

    Google translate is sh*t because it doesn't highlight single words that it fails to translate when leaving them in the source language.

  16. LDS Silver badge

    The dark shadows of C and Unix...

    .... where text was not regarded an important part of programming needing a specific support, and programmers thought it could handled like just an "array of bytes" and comparison could be simple byte-by-byte matches. Without understanding it could work only for a very small subset of languages - "ASCII English" only.

    It looks this very narrow mindset is still alive today - I would think people interested in languages processing would have known more about how properly "normalize" text before processing it - but once again it looks they know and understand English only.

    That's also why any translation that doesn't use English have far bigger chances of being even worse than those from or to English.

    1. Peter Gathercole Silver badge

      Re: The dark shadows of C and Unix...

      Unix and C pre-date almost all internationalisation in computing.

      You could just as easily sat the same about CP/M, DOS/360, RSX-11, and RSTS, and pretty much any other OS from the 1960's and 1970's, but none of these have survived.

      The closest thing to internationalised text were the code pages which altered various codepoints to specific national characters. Because all I/O devices like terminals and printers were pure hardware (rather than soft-defined with pixel addressable displays and printers that we have now), using anything other than US ASCII was a challenge involving setting switches to define which characters were where, even when the individual changes were minimal, like UK currency,

      At this time, characters and bytes could be treated the same, so C never really needed to handle NLS in any special way.

      Unfortunately, despite the Open Systems movement, UNIX became proprietary, and different vendors implemented NLS in different ways. There was a period of time where non-ASCII character set handling was a free-for-all, with ISO, Microsoft, IBM and others all having their own idea of how NLS character sets should be implemented. It was only when multi-byte character sets and UTF-8 became the lingua franca of interoperability, and things started to settle down.

  17. FlamingDeath Silver badge

    Intelligence is hard, even humans have not mastered it yet. It’s why we’re desperately searching in outta space

  18. Tron

    Not new.

    That's just a posh version of what most of us have been doing for years to dodge the c®et1n0u5 filters on social media.

    Academics get paid for this? Damn. I'm such a fool. All these years I've been working for a living.

  19. Man inna barrel Bronze badge

    Foreigners up to no good

    That proves it. Unicode is a devious plot for Cyrillic and Chinese to take over our computers, whereas all decent folks use 7-bit ASCII, as God intended.

  20. Henry Wertz 1 Gold badge

    input validation

    to me this is an input validation problem. i'd make sure "change direction" and "delete" type characters are handled before the data gets to the ai system, that unicode characters that look like others are handled (sticking a cyrillic or greek or chinese "letter that looks like e" in the middle of otherwise-roman characters, switch it to an e) and so on.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like

Biting the hand that feeds IT © 1998–2022