back to article Trojan Source attack: Code that says one thing to humans tells your compiler something very different, warn academics

The way Unicode's UTF-8 text encoding handles different languages could be misused to write malicious code that says one thing to humans and another to compilers, academics are warning. "What if it were possible to trick compilers into emitting binaries that did not match the logic visible in source code?" ask Cambridge …

  1. This post has been deleted by its author

  2. Gene Cash Silver badge

    No examples

    It's interesting that there's absolutely no downloadable examples of bidirectional text anywhere.

    There's not one text file I could download and look at with (for example) emacs or Android Studio.

    So I have no way of determining if this has an impact on me or not in my usual coding environments.

    1. Anonymous Coward
      Anonymous Coward

      Re: No examples

      Let me email you bidirectional_text.PDF ... it will give you two illustrations:

      1) Typical bidirectional text illustrated in multiple languages.

      2) Instructions to email me 40 BTC to recover your system.

      Hmm, I guess I ought to post this with the Joke icon to get you to open it.

    2. gazthejourno (Written by Reg staff)

      Re: No examples

      If you open the PDF paper linked in the article and scroll to the end, there's a bunch of examples there. They're C&Pable.

    3. Tim 11

      Re: No examples

      Just find paste some arabic/hebrew text and some english text into the same text file with an editor that supports it, then dump it out as hex and you'll see it

      1. Kristian Walsh

        Re: No examples

        That isn’t the vulnerability: Arabic and Hebrew text is automatically laid out right-to-left because that is the default line ordering for those codes - Latin text (including the keywords for programming languages) within larger blocks of Arabic and Hebrew is always laid out left-to-right by default. However, there are codepoints in Unicode which tell the text renderer to override that native ordering and instead render LTR as RTL and vice versa. This is the mechanism by which the vulnerability allows human reviewers to be fooled.

        The example in the paper is this line of Python:

        ''' Subtract funds from account then [RLI]''' ;return

        Which is rendered as:

        ''' Subtract funds from account then ⁧''' ;return

        That second line is the exact sequence of codepoints they talk about in the paper, with the Right-to-Left Isolate code (+2067) in place.

        As you can see, the codes themselves are invisible, but they affect the display of the text following them, and that is how this vulnerability works, by making compilable code appear to be within a comment. But if you drag your selection cursor over the second line, you’ll see something isn’t quite what it seems...

        Most syntax-highlighters will catch this (I just checked, and VS Code does), and you’d see a “code-coloured” word inside the comment, but it can be subtle - especially in editors that style the text within comments (e.g., for doc-strings). If you use a terminal, your results are compounded further by how the terminal deals with bidirectional text.

        1. TRT Silver badge

          Re: No examples

          Am I just being thick here, or should [RLI] reverse the CHARACTER order and not, as the illustration shows, the WORD order?

          1. badflorist

            Re: No examples

            The "return" keyword will resolve. Pretending it resolves to \0, then it becomes: \0; ''' Notice that the ';' comes BEFORE return, because if it came afterwards it wouldn't compile as the line wouldn't be terminated (or might, depending on the language).

            However, the problem really seems to be with how docstrings are handled, because allowing such keywords in docstrings is kind of crazy (is eval() allowed too?). Either way, maybe simply remove all unicode from anything other than std/gui print statements? Then if someone hacks that, well, all the rest is probably hacked too since they have write access.

          2. Draco
            Windows

            Re: No examples

            I'm too tired to read the Unicode documentation on the algorithm in detail.

            You can find it here:

            https://unicode.org/reports/tr9/

            But ... reversing ''' ; would give you ; '''

            The text that follows return is considered not to be part of the RLI chunk.

          3. Kristian Walsh

            Re: No examples

            No. The RLI code changes the layout default of the following characters to right-to-left, but it doesn’t override the behaviour of characters that want to associate left-to-right. There’s a different code (Right-to-left Override U+202E) which would force the renderer to ignore the character’s native bidi mode and treat it as Right-to-left, regardless.

            The idea of line layout defaults is tricky to get your head around if you don’t already know the rules for typesetting text in a mix of right-to-left and left-to-right scripts. Basically, an English word within Arabic will be shown in its proper order: letters of the word running from left to right, so you see “mouse mat ” not “tam esuom”, for example.

            To properly display text, the text renderer needs to figure out the line direction, but it only has a stream of characters to work with.

            To help, Unicode assigns each character a bidirectional class. Latin letters are “Strong Left-to-Right”, but punctuation is not strongly ordered: it follows the rule of surrounding text, so an exclamation-mark in Arabic would be left of the word it followed, in English it will be to the right, but both symbols are coded as U+0033. To allow this to work right, Unicode includes a bidirectional classification of “neutral” - which means that the character follows the already established line ordering. This is what this paper exploits.

            In the example, the RLI control code changes the current layout intent of the line to “right-to-left” from that point onward in the code stream. Thus, the following characters, which are punctuation, adopt right-to-left ordering because they are all in the “neutral” bidi category and the renderer has been told that the layout is now right-to-left dominant. However, the strong left-to-right characters 'r e t u r n' are still rendered as left-to-right, because that’s how they naturally associate even in text that is predominantly right-to-left.

            https://en.wikipedia.org/wiki/Bidirectional_text

            And here’s the documentation for the Unicode Bidi Algorithm.

            http://www.unicode.org/reports/tr9/

  3. Doctor Syntax Silver badge

    "We reserve the right to arbitrarily rename the next security discovery FLAMINGHELLDEATHPWNAGE."

    I think the headline is a good start. Something along the lines of Trojan $LanguageOrProduct Source. It's going to need a lot of domains and logos.

    1. Anonymous Coward
      Anonymous Coward

      Back-to-the Future-Pain?

  4. Loyal Commenter Silver badge

    This reminds me of the prank...

    ...where one could replace random semicolons in someone's source code with the Greek question-mark character (which is a homoglyph in most fonts) and then watch them pull their hair out trying to work out why their code no longer compiles...

    1. A random security guy

      Re: This reminds me of the prank...

      The thing is, Unicode was created to prevent this thing from occurring: characters that look the same should encode to the same value irrespective of the language. This philosophy created real problems for CJK character from Japan and China getting encoded to the same values.

      1. Irony Deficient Silver badge

        Unicode was created to prevent this thing from occurring:

        characters that look the same should encode to the same value irrespective of the language.

        The unification of similar looking characters only happened when their functions were similar and when backwards compatibility with other character encodings was not a significant issue. For example, German „Ö“ and Swedish ”Ö” were unified into a single code point (U+00D6), despite being located in different places in their respective alphabets. Unification of thousands of CJKV characters happened because their meanings were common across languages, despite (in some cases) minor glyph differences. However, Latin “H” (U+0048), Greek «Η» (U+0397), Cyrillic «Н» (U+041D), and Cherokee “Ꮋ” (U+13BB) were not unified, primarily because of the backwards compatibility issue. (Never mind that the Greek letter transliterates as “Ē”, the Cyrillic letter transliterates as “N”, and the Cherokee letter is pronounced /mi/ or /miː/, with tonal variations.)

        1. A random security guy

          Re: Unicode was created to prevent this thing from occurring:

          Thanks!!! For CJK, I remember that unification created more "political problems", not technical problems, I believe. It has been 25 years since I worked on it (when it was first introduced) so thanks for the clarification and the updates. I am behind times. Have to look up why they added Vietnamese to the set...

          1. Irony Deficient Silver badge

            why they added Vietnamese to the set

            Historically, Vietnamese was written using chữ Nôm (Han ideographs adapted for Vietnamese). The Latin Vietnamese alphabet, created by 17th century Jesuit missionaries, replaced everyday usage of chữ Nôm after World War I, during the French colonial era.

            1. Jonathan Richards 1 Silver badge

              Re: why they added Vietnamese to the set

              This was a fate somewhat narrowly avoided by Japan, too. The US Education Mission to Japan (1946) recommended that kanji, the characters adapted from Chinese writing, be replaced by romaji, the orthography of Japanese written with a Latin alphabet. This recommendation was based on little and outdated knowledge by the Mission members, ignored their terms of reference, and would have "invited the Japanese people to commit cultural suicide".

              Ref.: The First United States Education Mission to Japan [pdf]

              1. Ian Johnston Silver badge

                Re: why they added Vietnamese to the set

                I've done some work for a Japanese school, and the founder - a professor of education - says that not dropping kanji was a terrible mistake, because using them makes Japanese orthography ridiculously complicated.

                Remember that Japanese doesn't just use kanji with the Chinese meaning; it also uses them for completely different Japanese words which just happen to sound a bit similar. Hence, for example, the symbol for "tree" also means "Thursday" in Japan, but not in China. It's a complete mess.

        2. Loyal Commenter Silver badge

          Re: Unicode was created to prevent this thing from occurring:

          Indeed, the Greek letter that looks like capital "H" in the Roman alphabet is actually the capital letter eta (the lower case eta looks like the letter 'n' with a tail, 'η'). Similarly, the capital letter rho looks like a 'P', and lower case rho is similar to the lower case 'p', 'ρ'). If one were to start using the same unicode encoding for these letters, it would cause no end of confusion to have an encoding for lower case eta, but not for uppercase (use H instead), an maybe have a separate encoding for lower case rho, depending on how it is represented in the font you are using.

          Pretty obviously these are all separate letters, and are, quite sensibly, encoded as such in unicode. How they are represented is a concern for the font in which they are rendered, not the encoding. After all, long gone are the days of typewriters with no key for the number 1, because a lower-case 'l' is perfectly good.

      2. Loyal Commenter Silver badge

        Re: This reminds me of the prank...

        My understanding of Japanese 'kanji' is that they are logographs "borrowed" from Chinese script (the name meaning literally 'Han character'), but have different meanings, and certainly different pronunciations (Han Chinese is a tonal language, and Japanese is not), so from a character point of view they are technically the same characters.

        This is in much the same way that the word 'a' is spelt exactly the same in English in French, but has a completely different meaning (the indefinite article in English, and the third-person present tense participle of the verb avoir, to have, in French).

        It makes perfect sense for things that actually are the same character to be encoded in the same way. After all, we don't have an entirely separate alphabet encoding for every Western language that uses the Roman alphabet, even if some have extra characters that the others don't use, such as Eszett, 'ß', in German, or Eth, 'ð', and Thorn, 'þ', in Icelandic.

        1. Ian Johnston Silver badge

          Re: This reminds me of the prank...

          My understanding of Japanese 'kanji' is that they are logographs "borrowed" from Chinese script (the name meaning literally 'Han character'), but have different meanings

          Yup. They generally retain their Chinese meaning but can have five or even more additional Japanese meanings based on the sounds being vaguely similar.

  5. Vestas

    Sounds the same as "Reflections on Trusting Trust?" - a Turing Award lecture from the 1980s IIRC?

    You can't trust a compiler any more than you can trust third-party code you haven't analysed/tested.

    1. Jonathan Knight

      Yep - that was my first thought on reading this.

      https://www.cs.cmu.edu/~rdriley/487/papers/Thompson_1984_ReflectionsonTrustingTrust.pdf

      Ken realised the problem 38 years ago.

      1. Vestas

        Appropriate that it was 1984 really.

        I thought it was later.

  6. Bitsminer Bronze badge

    Left-click to compile

    Sounds like a good argument for requiring 100% code coverage testing......oh.

    1. sabroni Silver badge
      Boffin

      Re: 100% code coverage testing

      You put tests around the lines that do something that you want to happen. You take out the lines that don't need tests around them.

      Why wouldn't you have 100% code coverage? How many lines have you put in that don't need testing but have to be there? And what are they doing that's important enough to warrant inclusion but not important enough to verify with a test?

      1. Kristian Walsh

        Re: 100% code coverage testing

        “Why wouldn’t you have 100% code-coverage?” While you’re at it, why wouldn’t we have world peace either?

        Developers are allocated X hours to produce the software, but writing code-coverage testing requires a percentage of that time, and given that the original X hours was already insufficient to deliver the required features, it is very common for code to ship without 100% coverage. I’m not saying this is right, but it’s true.

        Good tests can take as long to write as the code – often longer if you need to expose error-recovery paths – but as an industry we don’t really care about quality, just new features...

        1. Bitsminer Bronze badge

          Re: 100% code coverage testing

          Code coverage testing = (# lines of code tested) / (# lines of code present) x 100%.

          But the bidi "bug" or transmutation of code can change the number of lines by (a) commenting them out invisibly to the developer and/or (b) rendering them unreachable, also invisibly to the developer.

          Hence my "...oh". The metric is...ummm....not reliable to detect the bidi bug(s), as the "code" might not be. And if you use a different tool than the compiler to measure it, well, then you have two problems.

          (I had previously submitted a lengthier reply, but once it got past El Reg's Perly gates, it seems to have fallen onto the floor of the second level of Hell, only to be lost and near forgotten.)

  7. Missing Semicolon Silver badge

    Old skool

    So whilst the language should support unicode text, the compiler should really barf on anything but 7-bit ASCII.

    1. captain veg Silver badge

      Re: Old skool

      Upvoted, but...

      The cultural imperialism shouldn't (ideally) extend to comments and string literals.

      -A.

      1. Yet Another Anonymous coward Silver badge

        Re: Old skool

        You can't enter these characters on a Fortran punched card so I don't see how this is a problem

        1. Arthur the cat Silver badge

          Re: Old skool

          You can't enter these characters on a Fortran punched card so I don't see how this is a problem

          Had Unicode turned up in the 70s I have no doubt IBM would have introduced EBCDIC-UNICODE, using the many unassigned codes in EBCDIC and an encoding inspired by, but totally different from and incompatible with, UTF-8.

    2. 2+2=5 Silver badge

      Re: Old skool

      > So whilst the language should support unicode text, the compiler should really barf on anything but 7-bit ASCII.

      No because it's not the compiler that's at fault. It's the editor that's at fault because it displays text that is actually inside a quoted string constant as outside the string.

      I note that Vim doesn't make this mistake. Other 'old skool' editors may also not make this mistake. New fangled ones seem to. :-)

    3. LDS Silver badge
      Thumb Down

      Re: Old skool

      Sorry. The A in ASCII stands for American and we are not Americans.

    4. Man inna barrel Bronze badge

      Re: Old skool

      One of the exploits used homoglyphs to disguise function or variable names. Their example substituted a Cyrillic letter H for a Latin letter H. I was surprised that this worked with any compiler, but apparently it did in many cases. As far as I am concerned, identifiers should be sequences of ASCII printable characters.

      I thought the C standard specified the lexical form of identifiers as beginning with an ASCII letter or underscore, then followed by ASCII alphanumeric characters or underscore. Maybe the standard has changed to widen the definition of "letter", to include Unicode, but I can't see the point of that. For example, if someone wants to write their identifiers in Cyrillic, they will still have to use Latin letters for language keywords, and to call common library functions.

      Guarding against homoglyphs outside of string literals and comments is not that difficult. However, the various exploits that insert or comment out code using Unicode in strings and comments are worrying. I happen to be working on a parser for a simple data representation language, so I will see how that behaves. It allows UTF-8 in strings and comments, rather than mandating ASCII-only code, and clumsy escapes for non-Latin text.

      By the way, a data representation language (e.g. XML or JSON) is a potential security hole on a typical Linux system, because such languages are often used to write system configuration files, and it is quite common practice to insert bits of config posted on forums. This is probably a good deal easier than trying to execute malicious code directly. You could alter permissions and file names, and so on.

  8. claimed
    Trollface

    Wontfix

    Clearly not an issue for real developers as its not like they would copy and paste code off stackoverflow, right?

    1. Yet Another Anonymous coward Silver badge

      Re: Wontfix

      Real developers use emacs butterfly mode

  9. Robert Carnegie Silver badge

    So is this hack's name "Trojan Source" or does it have a specific name, or is it looking for one? There come to mind:

    MADDOG

    ARDNASSAC but that is probably a tiny village in Perthshire or Provence, I haven't decided which.

  10. aerogems Bronze badge

    Hasn't this idea been around for literally years, if not decades? I swear I've heard this idea floated before multiple times. Seems like an idea that should be almost as old as compilers themselves, or at least after the idea that you could write malicious programs popped into someone's head and they put it into practice.

    1. 2+2=5 Silver badge

      Yes it has been around for years (so no idea why your down-voter feels otherwise). Here's a documented example from 2017:

      https://github.com/golang/go/issues/20209

  11. YetAnotherJoeBlow Bronze badge
    Pint

    Missed that...

    Irrespective of compilers used, my environment would have exposed that trick - except for Eclipse... I need to check my settings.

    I would have missed that I think. Thanks for the heads-up.

    edit: fix sentence

  12. Ciaran McHale

    The example given seems to be incorrect

    It seems to me there is an error in the example in the paper (and reproduced in the article) claiming to show how what appears to be just a Python comment is really a comment followed by a "return" statement.

    I had a look at the paper, and it explains that the "RLI" Unicode character (right-to-left isolate) will "Force treating following text as right-to-left without affecting adjacent text" until this mode is cancelled by another command or (in the case of the example code) a newline character. This right-to-left display happens not at the level of words, but rather at the level of individual characters. Thus, the line:

    ''' Subtract funds from bank account then RLI''' ;return

    should appear in a text editor as:

    ''' Subtract funds from bank account then nruter; '''

    1. Peter X

      Re: The example given seems to be incorrect

      I was wondering that... also, surely everything after RLI would need to be reversed? And then, it is not possible to detect shenanigans when RL/LR codes are not (1). balanced, and (2). contained within comments or string-literals?

      1. DialTone

        Re: The example given seems to be incorrect

        Unless I'm mistaken, I believe that the bulk of Latin characters are considered to be "strongly typed" as LTR and so are always rendered in that direction (which is why they're not showing reversed in the example). The ordering of the words in each paragraph however is affected by the bidi direction. The handling of punctuation is somewhat more complex.

        For example rendering the following source string in RTL mode: "print this word" would produce the rendered output string "word this print". Any characters which are strongly-typed as RTL will indeed be rendered as RTL in the order in which I described.

        A second example - imagine the word "arabic" were included (using arabic script - I've used latin to make the explanation obvious), then the source string "print this arabic word" would be rendered as "word cibara this print"

  13. Anonymous Coward
    Anonymous Coward

    making code do other stuff is way old.

    Used to do something like this in 6502(6510 c64) assembler code.

    The 6510 in the Commodore64 had some extra undocumented instructions that dis-assemblers and debuggers didn't decode.

    Using careful placement you could make the code do very unexpected stuff. (some games loaders trying to prevent copying used this, not very succesfully, NMI with memdump was easy was around it).

    Cracker frontends also used it to confuse other crackers from changing brag screens, and some demo writers used it too.

  14. phy445

    C/P from PDFs can be interesting...

    On a data analysis course I teach on, we had several students that copy/pasted example python code from the notes to find that it would not run. It looked OK and the original code had been checked so it was a bit of a mystery.

    It turned out that typing over the problem lines with seemingly identical text made the problems go away. My conclusion was that the PDF rendering had (presumably unicode) characters that pyCharm (students' development environment of choice) did not display but the python system could see and took exception too.

    1. Yet Another Anonymous coward Silver badge

      Re: C/P from PDFs can be interesting...

      So your solution to the space/tab war is to introduce a em-space/en-space war?

    2. Anonymous Coward
      Anonymous Coward

      Re: C/P from PDFs can be interesting...

      Actually PDF predates Unicode so while many modern PDFs are constructed in a way that maps glyphs to Unicode, some don't. Even if done properly, cut/paste is a bit of an issue. The main problem is that's no such thing as a word in PDF, just a glyph at a particular locaftion. So while most documents create their text in a predictable order, it gets trickier with columns, tables, callouts and so on.

      Other problem areas are things like bullet symbols, whitespace, and ambiguous characters - think Ohm and Omega, space and nbsp, hyphen and m-dash and so on. Things get worse for simple RTL like hebrew, worse again for arabic and by the time you get to Hindi, Bengali etc you're pretty much f*cked in terms of text extraction unless the software that created it has thought of this and done things properly. I'll save you checking; it almost certainly hasn't.

      The solution when constructing PDFs, as always, is to make sure they're PDF/A-3a and/or PDF/UA compatible.

  15. Tom 38 Silver badge

    Various tools out there already can prevent these examples

    For example, the python one would be caught by linting - you shouldn't have multiple statements (the doc string + the return) on a single line. Code auto-formatting, which is common in python projects these days, would also want to rewrite that on to multiple lines for the same reason.

    Therefore, if your CI pipeline has either of those checks in them, a change like this would not sneak past.

  16. Chairo
    Windows

    Back to the future

    Reminds me of the times of basic interpreters. It was possible to create a line of code and then add a rem with lots of delete characters and some other text.

  17. Anonymous Coward
    Anonymous Coward

    vim singled out for praise.

    Whenever I work with short sections of non-latin bidi text I find it easier to use hexdump than vim - no exaggeration. vim is utterly, utterly useless at any sort of RTL editing. Praise it if you must, but any resistance to this problem is certainly by accident rather than design.

    1. Michael

      Re: vim singled out for praise.

      Well, as I do all code reviews in vim, I'd catch this issue so no problems for me. The joy of being too lazy to use the latest new tools. Sometimes the old one just work well enough.

      1. Draco
        Windows

        Re: vim singled out for praise.

        Vim doesn't catch homoglyph attacks.

        It also didn't didn't display a codepoint for the Python comment attack and, instead, displayed the disguised version of the code - mind you, odd cursor movement through the code was a tip off.

        It did display codepoints for other bidi attacks, but it seems that certain bidi codes - like RLI (and perhaps a few others) - are rendered by Vim instead of displayed as codepoints

        I am using Vim 8.1 with patches 1-2269 on Ubuntu.

  18. Anonymous Coward
    Anonymous Coward

    *Why* does this work?

    IANAProgrammer, and I suppose I should RTFPDF, but *why* does this work? Do compilers understand anything other than 7 bit ASCII? I suppose they do, so you can use your RTL human language inside your program but damn, that seems like a huge oversight.

    Just think of how many pads of paper and pens we could buy if we stopped using computers.

    1. 2+2=5 Silver badge

      Re: *Why* does this work?

      It works because the code editor is buggy and displays something different to what is actually there i.e. what the programmer sees is not what the compiler sees.

      The cause of the behaviour is probably because it is using a standard text editing class and this bit of the behaviour is designed for a word processor and it hasn't been blocked or modified.

      An analogy might be if a code editor were to allow white-on-white formatting, like a word processor, and that were used to sneak code into a program in the guise of a few blank lines.

    2. Man inna barrel Bronze badge

      Re: *Why* does this work?

      It works because string literals and comments are meant for human reading, and it is therefore useful to accept Unicode in those parts of a program, even if the rest is in 7 bit ASCII. As far a I know, all Cyrillic characters require numeric escape sequences in order to be represented in a pure ASCII string literal. This would make Russian text illegible. The same applies to comments, which are intended for human reading, and not interpreted by a compiler.

      As others have said, forcing an unaccented Latin character set on non-English users is cultural imperialism. However, it is OK to have keywords and identifiers only in ASCII, because these are not actually words in any human language, but Computerish words. If you think you are talking English to your computer, you are in a state of sin.

  19. Binraider Silver badge

    While I’m biased being an English speaker, for programming purposes I’ve never understood the need for Unicode in your source. A simpler and more predictable char set used to be common for programming tasks, and would not be subject to this vector. Do I want unicode functionality in your program output downstream? Yes. But source doesn’t need it.

    Boats already sailed though so not much point complaining.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like

Biting the hand that feeds IT © 1998–2022