User topics

Article topics

Trojan Source attack: Code that says one thing to humans tells your compiler something very different, warn academics

The way Unicode's UTF-8 text encoding handles different languages could be misused to write malicious code that says one thing to humans and another to compilers, academics are warning. "What if it were possible to trick compilers into emitting binaries that did not match the logic visible in source code?" ask Cambridge …

COMMENTS

Post your comment

House rules Send corrections

Add to 'My topics'

This post has been deleted by its author
Monday 1st November 2021 17:45 GMT Gene Cash

No examples

It's interesting that there's absolutely no downloadable examples of bidirectional text anywhere.

There's not one text file I could download and look at with (for example) emacs or Android Studio.

So I have no way of determining if this has an impact on me or not in my usual coding environments.

2 6 Reply
1. Monday 1st November 2021 19:03 GMT Anonymous Coward
  
  Re: No examples
  
  Let me email you bidirectional_text.PDF ... it will give you two illustrations:
  
  1) Typical bidirectional text illustrated in multiple languages.
  
  2) Instructions to email me 40 BTC to recover your system.
  
  Hmm, I guess I ought to post this with the Joke icon to get you to open it.
  
  4 0 Reply
2. Tuesday 2nd November 2021 09:25 GMT gazthejourno
  
  Re: No examples
  
  If you open the PDF paper linked in the article and scroll to the end, there's a bunch of examples there. They're C&Pable.
  
  2 0 Reply
3. Tuesday 2nd November 2021 12:25 GMT Tim 11
  
  Re: No examples
  
  Just find paste some arabic/hebrew text and some english text into the same text file with an editor that supports it, then dump it out as hex and you'll see it
  
  1 0 Reply
  1. Tuesday 2nd November 2021 15:03 GMT Kristian Walsh
    Re: No examples
    
    That isn’t the vulnerability: Arabic and Hebrew text is automatically laid out right-to-left because that is the default line ordering for those codes - Latin text (including the keywords for programming languages) within larger blocks of Arabic and Hebrew is always laid out left-to-right by default. However, there are codepoints in Unicode which tell the text renderer to override that native ordering and instead render LTR as RTL and vice versa. This is the mechanism by which the vulnerability allows human reviewers to be fooled.
    
    The example in the paper is this line of Python:
    
    ''' Subtract funds from account then [RLI]''' ;return
    
    Which is rendered as:
    
    ''' Subtract funds from account then ⁧''' ;return
    
    That second line is the exact sequence of codepoints they talk about in the paper, with the Right-to-Left Isolate code (+2067) in place.
    
    As you can see, the codes themselves are invisible, but they affect the display of the text following them, and that is how this vulnerability works, by making compilable code appear to be within a comment. But if you drag your selection cursor over the second line, you’ll see something isn’t quite what it seems...
    
    Most syntax-highlighters will catch this (I just checked, and VS Code does), and you’d see a “code-coloured” word inside the comment, but it can be subtle - especially in editors that style the text within comments (e.g., for doc-strings). If you use a terminal, your results are compounded further by how the terminal deals with bidirectional text.
    
    1 1 Reply
    1. Tuesday 2nd November 2021 15:31 GMT TRT
      
      Re: No examples
      
      Am I just being thick here, or should [RLI] reverse the CHARACTER order and not, as the illustration shows, the WORD order?
      
      0 0 Reply
      1. Tuesday 2nd November 2021 16:27 GMT Anonymous Coward
        
        Re: No examples
        
        The "return" keyword will resolve. Pretending it resolves to \0, then it becomes: \0; ''' Notice that the ';' comes BEFORE return, because if it came afterwards it wouldn't compile as the line wouldn't be terminated (or might, depending on the language).
        
        However, the problem really seems to be with how docstrings are handled, because allowing such keywords in docstrings is kind of crazy (is eval() allowed too?). Either way, maybe simply remove all unicode from anything other than std/gui print statements? Then if someone hacks that, well, all the rest is probably hacked too since they have write access.
        
        0 0 Reply
      2. Tuesday 2nd November 2021 16:31 GMT Draco
        
        Re: No examples
        
        I'm too tired to read the Unicode documentation on the algorithm in detail.
        
        You can find it here:
        
        https://unicode.org/reports/tr9/
        
        But ... reversing ''' ; would give you ; '''
        
        The text that follows return is considered not to be part of the RLI chunk.
        
        0 0 Reply
      3. Wednesday 3rd November 2021 11:30 GMT Kristian Walsh
        
        Re: No examples
        
        No. The RLI code changes the layout default of the following characters to right-to-left, but it doesn’t override the behaviour of characters that want to associate left-to-right. There’s a different code (Right-to-left Override U+202E) which would force the renderer to ignore the character’s native bidi mode and treat it as Right-to-left, regardless.
        
        The idea of line layout defaults is tricky to get your head around if you don’t already know the rules for typesetting text in a mix of right-to-left and left-to-right scripts. Basically, an English word within Arabic will be shown in its proper order: letters of the word running from left to right, so you see “mouse mat ” not “tam esuom”, for example.
        
        To properly display text, the text renderer needs to figure out the line direction, but it only has a stream of characters to work with.
        
        To help, Unicode assigns each character a bidirectional class. Latin letters are “Strong Left-to-Right”, but punctuation is not strongly ordered: it follows the rule of surrounding text, so an exclamation-mark in Arabic would be left of the word it followed, in English it will be to the right, but both symbols are coded as U+0033. To allow this to work right, Unicode includes a bidirectional classification of “neutral” - which means that the character follows the already established line ordering. This is what this paper exploits.
        
        In the example, the RLI control code changes the current layout intent of the line to “right-to-left” from that point onward in the code stream. Thus, the following characters, which are punctuation, adopt right-to-left ordering because they are all in the “neutral” bidi category and the renderer has been told that the layout is now right-to-left dominant. However, the strong left-to-right characters 'r e t u r n' are still rendered as left-to-right, because that’s how they naturally associate even in text that is predominantly right-to-left.
        
        https://en.wikipedia.org/wiki/Bidirectional_text
        
        And here’s the documentation for the Unicode Bidi Algorithm.
        
        http://www.unicode.org/reports/tr9/
        
        3 0 Reply
Monday 1st November 2021 17:54 GMT Doctor Syntax

"We reserve the right to arbitrarily rename the next security discovery FLAMINGHELLDEATHPWNAGE."

I think the headline is a good start. Something along the lines of Trojan $LanguageOrProduct Source. It's going to need a lot of domains and logos.

1 0 Reply
1. Monday 1st November 2021 18:25 GMT Anonymous Coward
  
  Back-to-the Future-Pain?
  
  0 0 Reply
Monday 1st November 2021 18:02 GMT Loyal Commenter

This reminds me of the prank...

...where one could replace random semicolons in someone's source code with the Greek question-mark character (which is a homoglyph in most fonts) and then watch them pull their hair out trying to work out why their code no longer compiles...

18 0 Reply
1. Monday 1st November 2021 20:16 GMT A random security guy
  
  Re: This reminds me of the prank...
  
  The thing is, Unicode was created to prevent this thing from occurring: characters that look the same should encode to the same value irrespective of the language. This philosophy created real problems for CJK character from Japan and China getting encoded to the same values.
  
  3 1 Reply
  1. Monday 1st November 2021 22:34 GMT Irony Deficient
    
    Unicode was created to prevent this thing from occurring:
    
    characters that look the same should encode to the same value irrespective of the language.
    
    The unification of similar looking characters only happened when their functions were similar and when backwards compatibility with other character encodings was not a significant issue. For example, German „Ö“ and Swedish ”Ö” were unified into a single code point (U+00D6), despite being located in different places in their respective alphabets. Unification of thousands of CJKV characters happened because their meanings were common across languages, despite (in some cases) minor glyph differences. However, Latin “H” (U+0048), Greek «Η» (U+0397), Cyrillic «Н» (U+041D), and Cherokee “Ꮋ” (U+13BB) were not unified, primarily because of the backwards compatibility issue. (Never mind that the Greek letter transliterates as “Ē”, the Cyrillic letter transliterates as “N”, and the Cherokee letter is pronounced /mi/ or /miː/, with tonal variations.)
    
    14 0 Reply
    1. Monday 1st November 2021 23:25 GMT A random security guy
      
      Re: Unicode was created to prevent this thing from occurring:
      
      Thanks!!! For CJK, I remember that unification created more "political problems", not technical problems, I believe. It has been 25 years since I worked on it (when it was first introduced) so thanks for the clarification and the updates. I am behind times. Have to look up why they added Vietnamese to the set...
      
      2 0 Reply
      1. Tuesday 2nd November 2021 02:49 GMT Irony Deficient
        
        why they added Vietnamese to the set
        
        Historically, Vietnamese was written using chữ Nôm (Han ideographs adapted for Vietnamese). The Latin Vietnamese alphabet, created by 17th century Jesuit missionaries, replaced everyday usage of chữ Nôm after World War I, during the French colonial era.
        
        4 0 Reply
        
        Tuesday 2nd November 2021 10:24 GMT Jonathan Richards 1
        
        Re: why they added Vietnamese to the set
        
        This was a fate somewhat narrowly avoided by Japan, too. The US Education Mission to Japan (1946) recommended that kanji, the characters adapted from Chinese writing, be replaced by romaji, the orthography of Japanese written with a Latin alphabet. This recommendation was based on little and outdated knowledge by the Mission members, ignored their terms of reference, and would have "invited the Japanese people to commit cultural suicide".
        
        Ref.: The First United States Education Mission to Japan [pdf]
        
        2 0 Reply
        
        Tuesday 2nd November 2021 14:00 GMT Ian Johnston
        
        Re: why they added Vietnamese to the set
        
        I've done some work for a Japanese school, and the founder - a professor of education - says that not dropping kanji was a terrible mistake, because using them makes Japanese orthography ridiculously complicated.
        
        Remember that Japanese doesn't just use kanji with the Chinese meaning; it also uses them for completely different Japanese words which just happen to sound a bit similar. Hence, for example, the symbol for "tree" also means "Thursday" in Japan, but not in China. It's a complete mess.
        
        2 0 Reply
    2. Tuesday 2nd November 2021 12:24 GMT Loyal Commenter
      
      Re: Unicode was created to prevent this thing from occurring:
      
      Indeed, the Greek letter that looks like capital "H" in the Roman alphabet is actually the capital letter eta (the lower case eta looks like the letter 'n' with a tail, 'η'). Similarly, the capital letter rho looks like a 'P', and lower case rho is similar to the lower case 'p', 'ρ'). If one were to start using the same unicode encoding for these letters, it would cause no end of confusion to have an encoding for lower case eta, but not for uppercase (use H instead), an maybe have a separate encoding for lower case rho, depending on how it is represented in the font you are using.
      
      Pretty obviously these are all separate letters, and are, quite sensibly, encoded as such in unicode. How they are represented is a concern for the font in which they are rendered, not the encoding. After all, long gone are the days of typewriters with no key for the number 1, because a lower-case 'l' is perfectly good.
      
      0 0 Reply
  2. Tuesday 2nd November 2021 12:47 GMT Loyal Commenter
    
    Re: This reminds me of the prank...
    
    My understanding of Japanese 'kanji' is that they are logographs "borrowed" from Chinese script (the name meaning literally 'Han character'), but have different meanings, and certainly different pronunciations (Han Chinese is a tonal language, and Japanese is not), so from a character point of view they are technically the same characters.
    
    This is in much the same way that the word 'a' is spelt exactly the same in English in French, but has a completely different meaning (the indefinite article in English, and the third-person present tense participle of the verb avoir, to have, in French).
    
    It makes perfect sense for things that actually are the same character to be encoded in the same way. After all, we don't have an entirely separate alphabet encoding for every Western language that uses the Roman alphabet, even if some have extra characters that the others don't use, such as Eszett, 'ß', in German, or Eth, 'ð', and Thorn, 'þ', in Icelandic.
    
    2 0 Reply
    1. Tuesday 2nd November 2021 14:03 GMT Ian Johnston
      
      Re: This reminds me of the prank...
      
      My understanding of Japanese 'kanji' is that they are logographs "borrowed" from Chinese script (the name meaning literally 'Han character'), but have different meanings
      
      Yup. They generally retain their Chinese meaning but can have five or even more additional Japanese meanings based on the sounds being vaguely similar.
      
      2 0 Reply
Monday 1st November 2021 19:40 GMT Vestas

Sounds the same as "Reflections on Trusting Trust?" - a Turing Award lecture from the 1980s IIRC?

You can't trust a compiler any more than you can trust third-party code you haven't analysed/tested.

3 0 Reply
1. Tuesday 2nd November 2021 10:54 GMT Jonathan Knight
  
  Yep - that was my first thought on reading this.
  
  https://www.cs.cmu.edu/~rdriley/487/papers/Thompson_1984_ReflectionsonTrustingTrust.pdf
  
  Ken realised the problem 38 years ago.
  
  2 0 Reply
  1. Tuesday 2nd November 2021 17:05 GMT Vestas
    
    Appropriate that it was 1984 really.
    
    I thought it was later.
    
    1 0 Reply
Monday 1st November 2021 22:33 GMT Bitsminer

Left-click to compile

Sounds like a good argument for requiring 100% code coverage testing......oh.

0 0 Reply
1. Tuesday 2nd November 2021 08:05 GMT sabroni
  
  Re: 100% code coverage testing
  
  You put tests around the lines that do something that you want to happen. You take out the lines that don't need tests around them.
  
  Why wouldn't you have 100% code coverage? How many lines have you put in that don't need testing but have to be there? And what are they doing that's important enough to warrant inclusion but not important enough to verify with a test?
  
  3 1 Reply
  1. Wednesday 3rd November 2021 11:40 GMT Kristian Walsh
    
    Re: 100% code coverage testing
    
    “Why wouldn’t you have 100% code-coverage?” While you’re at it, why wouldn’t we have world peace either?
    
    Developers are allocated X hours to produce the software, but writing code-coverage testing requires a percentage of that time, and given that the original X hours was already insufficient to deliver the required features, it is very common for code to ship without 100% coverage. I’m not saying this is right, but it’s true.
    
    Good tests can take as long to write as the code – often longer if you need to expose error-recovery paths – but as an industry we don’t really care about quality, just new features...
    
    2 0 Reply
    1. Wednesday 3rd November 2021 13:13 GMT Bitsminer
      
      Re: 100% code coverage testing
      
      Code coverage testing = (# lines of code tested) / (# lines of code present) x 100%.
      
      But the bidi "bug" or transmutation of code can change the number of lines by (a) commenting them out invisibly to the developer and/or (b) rendering them unreachable, also invisibly to the developer.
      
      Hence my "...oh". The metric is...ummm....not reliable to detect the bidi bug(s), as the "code" might not be. And if you use a different tool than the compiler to measure it, well, then you have two problems.
      
      (I had previously submitted a lengthier reply, but once it got past El Reg's Perly gates, it seems to have fallen onto the floor of the second level of Hell, only to be lost and near forgotten.)
      
      1 0 Reply
    2. Thursday 30th March 2023 11:16 GMT nintendoeats
      
      Re: 100% code coverage testing
      
      Never mind code coverage...you can test %100 of the lines, but you may not be testing all differentiated scenarios, or verifying all expected behaviour. Chasing the dream of "actually testing everything" is a significant endeavour, requiring you to foresee all permutations and expected outputs.
      
      Deciding how much time to devote to tests is always a balancing act.
      
      0 0 Reply
Monday 1st November 2021 23:07 GMT Missing Semicolon

Old skool

So whilst the language should support unicode text, the compiler should really barf on anything but 7-bit ASCII.

19 1 Reply
1. Tuesday 2nd November 2021 11:36 GMT captain veg
  
  Re: Old skool
  
  Upvoted, but...
  
  The cultural imperialism shouldn't (ideally) extend to comments and string literals.
  
  -A.
  
  1 0 Reply
  1. Tuesday 2nd November 2021 12:42 GMT Yet Another Anonymous coward
    
    Re: Old skool
    
    You can't enter these characters on a Fortran punched card so I don't see how this is a problem
    
    9 0 Reply
    1. Tuesday 2nd November 2021 14:43 GMT Arthur the cat
      
      Re: Old skool
      
      You can't enter these characters on a Fortran punched card so I don't see how this is a problem
      
      Had Unicode turned up in the 70s I have no doubt IBM would have introduced EBCDIC-UNICODE, using the many unassigned codes in EBCDIC and an encoding inspired by, but totally different from and incompatible with, UTF-8.
      
      1 0 Reply
  2. Thursday 30th March 2023 11:18 GMT nintendoeats
    
    Re: Old skool
    
    Except that AFAICT, the example in the article has the offending code in a string literal.
    
    0 0 Reply
2. Tuesday 2nd November 2021 18:00 GMT Anonymous Coward
  
  Re: Old skool
  
  > So whilst the language should support unicode text, the compiler should really barf on anything but 7-bit ASCII.
  
  No because it's not the compiler that's at fault. It's the editor that's at fault because it displays text that is actually inside a quoted string constant as outside the string.
  
  I note that Vim doesn't make this mistake. Other 'old skool' editors may also not make this mistake. New fangled ones seem to. :-)
  
  1 0 Reply
3. Tuesday 2nd November 2021 18:01 GMT Anonymous Coward
  
  Re: Old skool
  
  Sorry. The A in ASCII stands for American and we are not Americans.
  
  2 0 Reply
4. Wednesday 3rd November 2021 11:55 GMT Man inna barrel
  
  Re: Old skool
  
  One of the exploits used homoglyphs to disguise function or variable names. Their example substituted a Cyrillic letter H for a Latin letter H. I was surprised that this worked with any compiler, but apparently it did in many cases. As far as I am concerned, identifiers should be sequences of ASCII printable characters.
  
  I thought the C standard specified the lexical form of identifiers as beginning with an ASCII letter or underscore, then followed by ASCII alphanumeric characters or underscore. Maybe the standard has changed to widen the definition of "letter", to include Unicode, but I can't see the point of that. For example, if someone wants to write their identifiers in Cyrillic, they will still have to use Latin letters for language keywords, and to call common library functions.
  
  Guarding against homoglyphs outside of string literals and comments is not that difficult. However, the various exploits that insert or comment out code using Unicode in strings and comments are worrying. I happen to be working on a parser for a simple data representation language, so I will see how that behaves. It allows UTF-8 in strings and comments, rather than mandating ASCII-only code, and clumsy escapes for non-Latin text.
  
  By the way, a data representation language (e.g. XML or JSON) is a potential security hole on a typical Linux system, because such languages are often used to write system configuration files, and it is quite common practice to insert bits of config posted on forums. This is probably a good deal easier than trying to execute malicious code directly. You could alter permissions and file names, and so on.
  
  1 0 Reply
Monday 1st November 2021 23:44 GMT claimed

Wontfix

Clearly not an issue for real developers as its not like they would copy and paste code off stackoverflow, right?

16 0 Reply
1. Tuesday 2nd November 2021 13:34 GMT Yet Another Anonymous coward
  
  Re: Wontfix
  
  Real developers use emacs butterfly mode
  
  7 0 Reply
Tuesday 2nd November 2021 01:04 GMT Robert Carnegie

So is this hack's name "Trojan Source" or does it have a specific name, or is it looking for one? There come to mind:

MADDOG

ARDNASSAC but that is probably a tiny village in Perthshire or Provence, I haven't decided which.

4 0 Reply
Tuesday 2nd November 2021 02:07 GMT aerogems

Hasn't this idea been around for literally years, if not decades? I swear I've heard this idea floated before multiple times. Seems like an idea that should be almost as old as compilers themselves, or at least after the idea that you could write malicious programs popped into someone's head and they put it into practice.

2 2 Reply
1. Tuesday 2nd November 2021 17:53 GMT Anonymous Coward
  
  Yes it has been around for years (so no idea why your down-voter feels otherwise). Here's a documented example from 2017:
  
  https://github.com/golang/go/issues/20209
  
  2 0 Reply
Tuesday 2nd November 2021 05:49 GMT YetAnotherJoeBlow

Missed that...

Irrespective of compilers used, my environment would have exposed that trick - except for Eclipse... I need to check my settings.

I would have missed that I think. Thanks for the heads-up.

edit: fix sentence

0 0 Reply
Tuesday 2nd November 2021 11:00 GMT Ciaran McHale

The example given seems to be incorrect

It seems to me there is an error in the example in the paper (and reproduced in the article) claiming to show how what appears to be just a Python comment is really a comment followed by a "return" statement.

I had a look at the paper, and it explains that the "RLI" Unicode character (right-to-left isolate) will "Force treating following text as right-to-left without affecting adjacent text" until this mode is cancelled by another command or (in the case of the example code) a newline character. This right-to-left display happens not at the level of words, but rather at the level of individual characters. Thus, the line:

''' Subtract funds from bank account then RLI''' ;return

should appear in a text editor as:

''' Subtract funds from bank account then nruter; '''

3 1 Reply
1. Tuesday 2nd November 2021 12:20 GMT Peter X
  
  Re: The example given seems to be incorrect
  
  I was wondering that... also, surely everything after RLI would need to be reversed? And then, it is not possible to detect shenanigans when RL/LR codes are not (1). balanced, and (2). contained within comments or string-literals?
  
  2 0 Reply
  1. Tuesday 2nd November 2021 13:11 GMT DialTone
    
    Re: The example given seems to be incorrect
    
    Unless I'm mistaken, I believe that the bulk of Latin characters are considered to be "strongly typed" as LTR and so are always rendered in that direction (which is why they're not showing reversed in the example). The ordering of the words in each paragraph however is affected by the bidi direction. The handling of punctuation is somewhat more complex.
    
    For example rendering the following source string in RTL mode: "print this word" would produce the rendered output string "word this print". Any characters which are strongly-typed as RTL will indeed be rendered as RTL in the order in which I described.
    
    A second example - imagine the word "arabic" were included (using arabic script - I've used latin to make the explanation obvious), then the source string "print this arabic word" would be rendered as "word cibara this print"
    
    2 0 Reply
Tuesday 2nd November 2021 11:22 GMT Anonymous Coward

making code do other stuff is way old.

Used to do something like this in 6502(6510 c64) assembler code.

The 6510 in the Commodore64 had some extra undocumented instructions that dis-assemblers and debuggers didn't decode.

Using careful placement you could make the code do very unexpected stuff. (some games loaders trying to prevent copying used this, not very succesfully, NMI with memdump was easy was around it).

Cracker frontends also used it to confuse other crackers from changing brag screens, and some demo writers used it too.

3 0 Reply
Tuesday 2nd November 2021 13:58 GMT phy445

C/P from PDFs can be interesting...

On a data analysis course I teach on, we had several students that copy/pasted example python code from the notes to find that it would not run. It looked OK and the original code had been checked so it was a bit of a mystery.

It turned out that typing over the problem lines with seemingly identical text made the problems go away. My conclusion was that the PDF rendering had (presumably unicode) characters that pyCharm (students' development environment of choice) did not display but the python system could see and took exception too.

0 0 Reply
1. Tuesday 2nd November 2021 16:48 GMT Yet Another Anonymous coward
  
  Re: C/P from PDFs can be interesting...
  
  So your solution to the space/tab war is to introduce a em-space/en-space war?
  
  0 0 Reply
2. Tuesday 2nd November 2021 18:00 GMT Anonymous Coward
  
  Re: C/P from PDFs can be interesting...
  
  Actually PDF predates Unicode so while many modern PDFs are constructed in a way that maps glyphs to Unicode, some don't. Even if done properly, cut/paste is a bit of an issue. The main problem is that's no such thing as a word in PDF, just a glyph at a particular locaftion. So while most documents create their text in a predictable order, it gets trickier with columns, tables, callouts and so on.
  
  Other problem areas are things like bullet symbols, whitespace, and ambiguous characters - think Ohm and Omega, space and nbsp, hyphen and m-dash and so on. Things get worse for simple RTL like hebrew, worse again for arabic and by the time you get to Hindi, Bengali etc you're pretty much f*cked in terms of text extraction unless the software that created it has thought of this and done things properly. I'll save you checking; it almost certainly hasn't.
  
  The solution when constructing PDFs, as always, is to make sure they're PDF/A-3a and/or PDF/UA compatible.
  
  0 0 Reply
Tuesday 2nd November 2021 15:03 GMT Tom 38

Various tools out there already can prevent these examples

For example, the python one would be caught by linting - you shouldn't have multiple statements (the doc string + the return) on a single line. Code auto-formatting, which is common in python projects these days, would also want to rewrite that on to multiple lines for the same reason.

Therefore, if your CI pipeline has either of those checks in them, a change like this would not sneak past.

1 0 Reply
Tuesday 2nd November 2021 17:37 GMT Chairo

Back to the future

Reminds me of the times of basic interpreters. It was possible to create a line of code and then add a rem with lots of delete characters and some other text.

1 0 Reply
Tuesday 2nd November 2021 18:07 GMT Anonymous Coward

vim singled out for praise.

Whenever I work with short sections of non-latin bidi text I find it easier to use hexdump than vim - no exaggeration. vim is utterly, utterly useless at any sort of RTL editing. Praise it if you must, but any resistance to this problem is certainly by accident rather than design.

0 0 Reply
1. Tuesday 2nd November 2021 19:20 GMT Michael
  
  Re: vim singled out for praise.
  
  Well, as I do all code reviews in vim, I'd catch this issue so no problems for me. The joy of being too lazy to use the latest new tools. Sometimes the old one just work well enough.
  
  0 0 Reply
  1. Wednesday 3rd November 2021 13:40 GMT Draco
    
    Re: vim singled out for praise.
    
    Vim doesn't catch homoglyph attacks.
    
    It also didn't didn't display a codepoint for the Python comment attack and, instead, displayed the disguised version of the code - mind you, odd cursor movement through the code was a tip off.
    
    It did display codepoints for other bidi attacks, but it seems that certain bidi codes - like RLI (and perhaps a few others) - are rendered by Vim instead of displayed as codepoints
    
    I am using Vim 8.1 with patches 1-2269 on Ubuntu.
    
    0 0 Reply
Tuesday 2nd November 2021 20:21 GMT Anonymous Coward

*Why* does this work?

IANAProgrammer, and I suppose I should RTFPDF, but *why* does this work? Do compilers understand anything other than 7 bit ASCII? I suppose they do, so you can use your RTL human language inside your program but damn, that seems like a huge oversight.

Just think of how many pads of paper and pens we could buy if we stopped using computers.

2 0 Reply
1. Wednesday 3rd November 2021 11:58 GMT Anonymous Coward
  
  Re: *Why* does this work?
  
  It works because the code editor is buggy and displays something different to what is actually there i.e. what the programmer sees is not what the compiler sees.
  
  The cause of the behaviour is probably because it is using a standard text editing class and this bit of the behaviour is designed for a word processor and it hasn't been blocked or modified.
  
  An analogy might be if a code editor were to allow white-on-white formatting, like a word processor, and that were used to sneak code into a program in the guise of a few blank lines.
  
  0 0 Reply
2. Wednesday 3rd November 2021 12:19 GMT Man inna barrel
  
  Re: *Why* does this work?
  
  It works because string literals and comments are meant for human reading, and it is therefore useful to accept Unicode in those parts of a program, even if the rest is in 7 bit ASCII. As far a I know, all Cyrillic characters require numeric escape sequences in order to be represented in a pure ASCII string literal. This would make Russian text illegible. The same applies to comments, which are intended for human reading, and not interpreted by a compiler.
  
  As others have said, forcing an unaccented Latin character set on non-English users is cultural imperialism. However, it is OK to have keywords and identifiers only in ASCII, because these are not actually words in any human language, but Computerish words. If you think you are talking English to your computer, you are in a state of sin.
  
  0 0 Reply
Thursday 4th November 2021 08:17 GMT Binraider

While I’m biased being an English speaker, for programming purposes I’ve never understood the need for Unicode in your source. A simpler and more predictable char set used to be common for programming tasks, and would not be subject to this vector. Do I want unicode functionality in your program output downstream? Yes. But source doesn’t need it.

Boats already sailed though so not much point complaining.

0 1 Reply

POST COMMENT House rules

Not a member of The Register? Create a new account here.

Topics

Special Features

Vendor Voice

Resources

COMMENTS

No examples

Re: No examples

Re: No examples

Re: No examples

Re: No examples

Re: No examples

Re: No examples

Re: No examples

Re: No examples

This reminds me of the prank...

Re: This reminds me of the prank...

Unicode was created to prevent this thing from occurring:

Re: Unicode was created to prevent this thing from occurring:

why they added Vietnamese to the set

Re: why they added Vietnamese to the set

Re: why they added Vietnamese to the set

Re: Unicode was created to prevent this thing from occurring:

Re: This reminds me of the prank...

Re: This reminds me of the prank...

Left-click to compile

Re: 100% code coverage testing

Re: 100% code coverage testing

Re: 100% code coverage testing

Re: 100% code coverage testing

Old skool

Re: Old skool

Re: Old skool

Re: Old skool

Re: Old skool

Re: Old skool

Re: Old skool

Re: Old skool

Wontfix

Re: Wontfix

Missed that...

The example given seems to be incorrect

Re: The example given seems to be incorrect

Re: The example given seems to be incorrect

making code do other stuff is way old.

C/P from PDFs can be interesting...

Re: C/P from PDFs can be interesting...

Re: C/P from PDFs can be interesting...

Various tools out there already can prevent these examples

Back to the future

vim singled out for praise.

Re: vim singled out for praise.

Re: vim singled out for praise.

*Why* does this work?

Re: *Why* does this work?

Re: *Why* does this work?

POST COMMENT House rules

Enter your comment

Add an icon

Other stories you might like

China caught – again – with its malware in another nation's power grid

Godfather malware makes banking apps an offer they can’t refuse

Qbot malware adapts to live another day … and another …

Russian charged with smuggling US counterintel tech to Motherland

Legit Android apps poisoned by sticky 'Zombinder' malware

That 3CX supply chain attack keeps getting worse: Other vendors hit

Dridex malware pops back up and turns its attention to macOS

Gootloader malware updated with PowerShell, sneaky JavaScript

WordPress-powered sites backdoored after FishPig suffers supply chain attack

Good news, URSNIF no longer a banking trojan. Bad news, it's now a backdoor

Steganography alert: Backdoor spyware stashed in Microsoft logo

Cybercriminals target games popular with kids to distribute malware

About Us

Our Websites

Your Privacy

Why does this work?

Re: Why does this work?

Re: Why does this work?