back to article Forget anonymity, we can remember you wholesale with machine intel, hackers warned

Anonymous programmers, from malware writers to copyright infringers and those baiting governments with censorship-foiling software, may all be unveiled using stylistic programming traits which survive into the compiled binaries – regardless of common obfuscation methods. Youtube Video The work, titled De-anonymizing …

Page:

  1. DropBear
    Pint

    Happy new year, and here's a pint for you guys for the reference to the original book form of "Total Recall"...!

    EDIT: ...also, I suppose these good folks just proved beyond a shadow of a doubt that programming IS indeed a form of art.

    1. JLV

      given the subject, this brings to mind a short story by PKD (can't recall the title) where a robot investigates a murder scene and gradually narrows down the possible culprits by trawling a database of characteristics for the population vs evidence it finds at the scene. I believe that it turns out that the whole evidence set was being gamed from the start.

      1. VinceH

        That rang a bell, so I just took a look at a list of PKD short stories and novellas to see if the right title would jump out at me.

        It didn't - but I did spot The Variable Man; I was thinking about that very story quite recently, but couldn't remember what it was called or who it was by!

  2. ZenCoder

    These detection methods don't scale.

    With statistical detection methods the number of false positives and false negatives increases geometrically with sample size.

    Increase the sample size to 1000, then 10,000, and you will see its pointless except to conjure up some grant money.

    1. Anonymous Coward
      Anonymous Coward

      Re: These detection methods don't scale.

      They don't scale, no. But if there are 50 coders in a company and a hacker's style matches one of them that person can expect a more thorough investigation.

      1. P. Lee

        Re: These detection methods don't scale.

        The question is, how thorough can your investigation be?

        With a dev environment and code on a tiny removable storage device. Only the incompetent are going to be caught. Perhaps that is enough though.

        This is not hand-writing. I can't see why those statistical markers cannot themselves be reverse-engineered and used to obfuscate the authors code.

        1. Michael Wojcik Silver badge

          Re: These detection methods don't scale.

          I can't see why those statistical markers cannot themselves be reverse-engineered and used to obfuscate the authors code.

          Oh, they can. On a sufficiently-fast system, you could even do this in an IDE or toolchain in nearly real time: as the programmer makes changes, the system could show what fingerprints the model identifies in it, and even suggest changes that would alter the fingerprint.

          There are a few obstacles:

          - It's some effort to implement something like this. Most people are too lazy, even if they're capable, or simply don't care - it's not a prominent aspect of their threat model.

          - It's resource-intensive. Modern IDEs already soak up an idiotic amount of CPU time, I/O bandwidth, etc. Will programmers (of whatever motivation) feel like applying resources to the problem?

          - Your adversaries may be using different models. That could mean anything from sufficiently-different training data, to different feature sets, to entirely different classifiers.

      2. Anonymous Coward
        Anonymous Coward

        Re: These detection methods don't scale.

        >They don't scale, no. But if there are 50 coders in a company and a hacker's style matches one of them that person can expect a more thorough investigation.

        Yeah but you're dealing with the hacker mentality - they'd be emulating a colleague's ever-so slightly flawed attempt to copy another colleague's style.

        1. Michael H.F. Wilkinson

          Re: These detection methods don't scale.

          Not only do these methods not necessarily scale, they need an ever increasing ground truth of identified code for training. This is not trivial to obtain. Besides, as more and more coders are added, you have to worry about the number of degrees of freedom in coding anything, i.e. are there enough different coding styles to distinguish the millions of coders on this planet. Besides, you have to deal with code developed by teams (which is the normal situation), which will either show a mixture of styles, or predominantly show the style of the loudest mouth in the team, with a small admixture of the other members. Similarly, what happens when a new coder refactors old stuff? I know I have seriously refactored a program written by some students to adapt it to new use cases. It is still not really like my

          You could of course show that a certain style is consistent with a known sample of some hacker's work, but even then people might slowly change their coding style. Having had a look at some of my earlier efforts, I know I have changed style a great deal (thank goodness ;-)), if only by incorporating OO techniques

          1. Michael Wojcik Silver badge

            Re: These detection methods don't scale.

            Not only do these methods not necessarily scale, they need an ever increasing ground truth of identified code for training. This is not trivial to obtain.

            No. Unsupervised learning by kernel extension with noisy input is a well-researched and broadly successful area.

            And additional input is extremely easy to obtain.

    2. StaudN

      Re: These detection methods don't scale.

      I agree that the current numbers are insufficient, but surely worth trying to develop such identification mechanisms? : would be a very powerful network defense tool to have a signature intercept capable of picking up code by known malware authors...

      1. SoaG

        Re: code by known malware authors

        Perhaps could be made to work the other way too. Within a secure network, regardless of credentials of the user/admin trying to run something, refuse to run any code other than by whitelisted authors.

        1. Michael Wojcik Silver badge

          Re: code by known malware authors

          Within a secure network, regardless of credentials of the user/admin trying to run something, refuse to run any code other than by whitelisted authors.

          For that application, code signing is far more reliable, simple, and scalable.

    3. Hargrove

      Re: These detection methods don't scale.

      The best and most accessible discussion of the problem of data classification is in a couple of papers by Tom Fawcett. These deal with something called ROC curves. ROC originally stood for "receiver operating characteristic", referring to the ability of a receiver to classify targets in noise. An analogous phenomenon occurs in pattern matching in digital data, where the term "relative operating characteristic" is used. The following link is good starting point.

      http://www.hpl.hp.com/techreports/2003/HPL-2003-4.pdf

      The problem boils down to one of true detection and false alarm rates. You can have an arbitrarily high true detection rate if you can live with an arbitrarily high rate of false alarms. You can reduce the number of false alarms to an arbitrarily low level. But, only at the cost of missing an arbitrarily large percentage of true targets.

      The phrase "No such thing as a free lunch" is occasionally used in the literature to describe this and ZenCoder's comment is right to the mark.

      1. Michael Wojcik Silver badge

        Re: These detection methods don't scale.

        The problem boils down to one of true detection and false alarm rates

        I'm amazed that so many Reg readers have time to post comments like this, given the hours y'all must devote to explaining the finer principles of egg-sucking to your grandmothers.

        No one credible does machine-learning research without being well-versed in basic concepts like precision and recall rates. That's, like, week 2 of your Introduction to Machine Learning class.

        Admittedly a more-sophisticated understanding of statistics is not universal among ML researchers and implementers; folks like Vincent Granville bang on about that incessantly, and they have a point. But that doesn't mean their work is automatically useless, as is trivially demonstrated by the fact that it's very often put to use. Google Translate, say, may be rubbish compared to human translators, but that hasn't stopped people from using it.

        Even a 66% success rate is useful in some applications.

  3. Robin Bradshaw

    Ctrl+C, Ctrl+V

    stackoverflow is going to end up getting blamed for everything :)

    1. Anonymous Coward
      Anonymous Coward

      Re: Ctrl+C, Ctrl+V

      I'm amazed at where my code manages to turn up after posting to stackoverflow.

      Worrying too, as many are minimal 'batteries not included' examples, or 'making this good is left as an exercise for the reader'.

  4. a_yank_lurker

    Somewhat of a yawn

    Personal coding styles will vary much like personal writing styles vary. However with coding there are conventions used by both by corporations and the language to make for more understandable code. Also, I would like to see a comparison with writing styles.

  5. David Roberts
    Holmes

    Hmmm.....

    ...that one started life as a Fortran programmer.....the COBOL is strong in this one......

    1. Ole Juul

      Re: Hmmm.....

      He codes with an accent.

    2. Chemist

      Re: Hmmm.....

      "...that one started life as a Fortran programmer..."

      May the FORTH be with you

      1. Madeye

        Re: Hmmm.....

        FORTH would be a terrible language to hack in

        The language is fundamentally mutable and relatively obscure which means each author would most likely leave a clearly identifiable fingerprint.

        1. Pirate Dave Silver badge
          Pirate

          Re: Hmmm.....

          "The language is fundamentally mutable and relatively obscure which means each author would most likely leave a clearly identifiable fingerprint."

          Or they could just question all 12 of the FORTH programmers in the world...

  6. This post has been deleted by its author

  7. Winkypop Silver badge
    Joke

    May contain traces of pepperoni

    They can detect pizza preferences from coder styles?

  8. Anonymous Coward
    Anonymous Coward

    Lesson Learned

    If you walk the Dark Side, don't contribute to open source. Don't leave digital fingerprints.

    Paper worthy of Boffinhood, especially as it does discuss the limitations of their method.

    1. Sureo

      Re: Lesson Learned

      Have they tried it on stuxnet?

      1. Michael Wojcik Silver badge

        Re: Lesson Learned

        Have they tried it on stuxnet?

        They'd need something to compare it to.

        This isn't a magical oracle that maps object code to arbitrary authors. It's a classifier. It tells you what part of its training corpus a candidate most closely matches.

  9. This post has been deleted by its author

    1. Mike Bell

      Thank God I don't have to support your code! Indentations show in a very clear way how blocks of code relate to each other.

      1. This post has been deleted by its author

        1. cambsukguy

          I will stick to plenty of comments and suitable indenting I think, they don't seem to be mutually exclusive.

        2. Anonymous Coward
          Anonymous Coward

          If you see behaviour that you think is a bug, you really don't want to put your whole trust in the comments. After all, the comments were likely written by the person who wrote the code (and the bug, if it exists)

          But then again, both the code and the comments may well have been updated by "freestylers".

          1. Doctor Syntax Silver badge

            "After all, the comments were likely written by the person who wrote the code"

            True, but they may not say the same thing.

        3. Jeffrey Nonken

          1970s coder here

          I too detest uncommented code. And I too come from a FORTRAN background, ultimately. But I not only comment the crap out of my code, I also indent and structure the crap out of it. Maybe I'm just not as smart as you; I want to be able to read it in six months, and I can't instantaneously understand unfamiliar code at a glance. I find it much more readable if it's nicely structured.

          I feel that you're just trading one type of laziness for another. Making your code readable isn't just for you; it's also for other programmers. And that's part of what I am trying to do. Seems like you're only doing it for yourself.

          And if that's true of your coding style, I have to wonder about your comments.

          MHO. YMMV.

          Fortunately for the rest of us, there is Artistic Style. http://astyle.sourceforge.net/

          1. Jeffrey Nonken

            Re: 1970s coder here

            Oh yeah. I program in assembly regularly. I don't indent, it doesn't feel natural to the language, but I use plenty of other visual breaks and clues, and I meticulously align the comments such that they're easily readable.

            But 'C' and other structured languages, I indent. And use Artistic Style to clean up if I have to shift things around enough that they get crazy.

        4. nijam Silver badge

          Neither indentation nor comments are necessarily as useful as is widely believed - for example, most languages have pretty-printing available, after all.

          In fact, the best you can hope for is that the comment/indentation is not inconsistent with what the code actually 'means' to the computer.

      2. Primus Secundus Tertius Silver badge

        use a pretty printer

        Indenting is only useful if it shows what the computer thinks, rather than what the programmer thinks.

        Example: a bug in a Coral66 program, where the preceding 'comment' lacked a terminating semicolon. So the program statement got absorbed into the comment and was therefore absent from the binary.

    2. Steven Roper

      I do indent my code, but I hate the convention that has the opening curly brace on the same line as the conditional that spawns it, such as:

      if (condition == test) {

      ....doCondition();

      ....etc...

      }

      I always indent my code with the opening and closing braces lined up and on their own lines. It makes code blocks easier to spot as well as spacing everything out for easier legibility, like this:

      if (condition == test)

      {

      ....doCondition();

      ....etc...

      }

      This way bracket highlighting at the cursor makes both braces instantly spottable at the left, rather than having to hunt across lines of code to find the opening curly brace!

      Of course this also wouldn't survive compilation, but anyone seeing my source would peg me as its author since I swear I'm the only programmer I know who insists on arranging my braces this way!

      1. Mike Bell

        Quite right, too. Mostly appears in printed material, where a modest amount of paper can be saved.

      2. Crazy Operations Guy

        " I swear I'm the only programmer I know who insists on arranging my braces this way!"

        Not the only one, it's the style in K&R and in the Unix source (and its derivatives).

      3. Anonymous Coward
        Anonymous Coward

        If you swear that you are the only programmer you know who insists on arranging your braces that way, you either haven't programmed very much, or you don't know very many programmers.

        Or you just like to swear.

        1. Steven Roper

          "If you swear that you are the only programmer you know who insists on arranging your braces that way, you either haven't programmed very much, or you don't know very many programmers."

          Your second guess is the correct one. I've been programming since 1983, when I started with first CBM BASIC and then 6510 assembler on the Commodore 64, and went on from there. But it's not a particularly sociable lifestyle, and I'm not a particularly sociable man, so I only know a dozen or so programmers.

          But whenever I see code on the internet, whether it's stackoverflow, git or SF, it nearly always has opening curly braces following the conditional rather than on the next line. So I'm sure I can be forgiven for thinking I'm alone in this convention!

      4. alain williams Silver badge

        Code layout

        About 30 years ago at a LUUG (London Unix User Group) meeting in a pub DT asked how an if/else should be formatted. There were 14 of us and 13 different answers; we were all prepared to defend our own style as being the best - all their arguments were wrong since it was obvious that my own style was the only good one.

        Many preferences seem to depend on which languages you cut your programming teeth on, how they were laid out.

        As regards your examples:

        * the opening '{' should be on the line with the 'if', the '}' ends the 'if' the '{' is less important and just makes the if body multi statement.

        * there should not be a space after the 'if' - why, in my case, because snobol did not allow it.

      5. Jeffrey Nonken

        I strongly mislike K&R style braces placement. I would find your style quite acceptable.

      6. arctic_haze

        Braces in the same column

        You would be mistaken for me and the other way. I believe this is the old school.

        However, I would support tracking down and isolating from the society all the developers who started with BASIC.

      7. Michael Wojcik Silver badge

        I hate the convention that...

        Oh goody, let's have a style religious war in a Reg forum. It's been a while.

        Personally, I hate it when people use an integral number of spaces for indentation. My indents are always a multiple of π.

    3. Destroy All Monsters Silver badge

      > Overall, my code is like nobody else's.

      Code is for reading by other programmers.

      Your work is utterly useless and if you have managers, they should remove themselves from the gene pool.

      Probably needs a special Developer Darwin Award.

    4. keithpeter Silver badge
      Pint

      Arthur Whitney

      "Overall, my code is like nobody else's."

      http://www.jsoftware.com/jwiki/Essays/Incunabulum

      Try that for a C style. Whitney seems to be doing OK with it

      http://queue.acm.org/detail.cfm?id=1531242

    5. swm

      Try understanding LISP code without indentation - we used to write LISP in the 1970's without an indenting editor so you needed to have some sanity in your style.

Page:

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like