
Happy new year, and here's a pint for you guys for the reference to the original book form of "Total Recall"...!
EDIT: ...also, I suppose these good folks just proved beyond a shadow of a doubt that programming IS indeed a form of art.
Anonymous programmers, from malware writers to copyright infringers and those baiting governments with censorship-foiling software, may all be unveiled using stylistic programming traits which survive into the compiled binaries – regardless of common obfuscation methods. Youtube Video The work, titled De-anonymizing …
given the subject, this brings to mind a short story by PKD (can't recall the title) where a robot investigates a murder scene and gradually narrows down the possible culprits by trawling a database of characteristics for the population vs evidence it finds at the scene. I believe that it turns out that the whole evidence set was being gamed from the start.
The question is, how thorough can your investigation be?
With a dev environment and code on a tiny removable storage device. Only the incompetent are going to be caught. Perhaps that is enough though.
This is not hand-writing. I can't see why those statistical markers cannot themselves be reverse-engineered and used to obfuscate the authors code.
I can't see why those statistical markers cannot themselves be reverse-engineered and used to obfuscate the authors code.
Oh, they can. On a sufficiently-fast system, you could even do this in an IDE or toolchain in nearly real time: as the programmer makes changes, the system could show what fingerprints the model identifies in it, and even suggest changes that would alter the fingerprint.
There are a few obstacles:
- It's some effort to implement something like this. Most people are too lazy, even if they're capable, or simply don't care - it's not a prominent aspect of their threat model.
- It's resource-intensive. Modern IDEs already soak up an idiotic amount of CPU time, I/O bandwidth, etc. Will programmers (of whatever motivation) feel like applying resources to the problem?
- Your adversaries may be using different models. That could mean anything from sufficiently-different training data, to different feature sets, to entirely different classifiers.
>They don't scale, no. But if there are 50 coders in a company and a hacker's style matches one of them that person can expect a more thorough investigation.
Yeah but you're dealing with the hacker mentality - they'd be emulating a colleague's ever-so slightly flawed attempt to copy another colleague's style.
Not only do these methods not necessarily scale, they need an ever increasing ground truth of identified code for training. This is not trivial to obtain. Besides, as more and more coders are added, you have to worry about the number of degrees of freedom in coding anything, i.e. are there enough different coding styles to distinguish the millions of coders on this planet. Besides, you have to deal with code developed by teams (which is the normal situation), which will either show a mixture of styles, or predominantly show the style of the loudest mouth in the team, with a small admixture of the other members. Similarly, what happens when a new coder refactors old stuff? I know I have seriously refactored a program written by some students to adapt it to new use cases. It is still not really like my
You could of course show that a certain style is consistent with a known sample of some hacker's work, but even then people might slowly change their coding style. Having had a look at some of my earlier efforts, I know I have changed style a great deal (thank goodness ;-)), if only by incorporating OO techniques
Not only do these methods not necessarily scale, they need an ever increasing ground truth of identified code for training. This is not trivial to obtain.
No. Unsupervised learning by kernel extension with noisy input is a well-researched and broadly successful area.
And additional input is extremely easy to obtain.
I agree that the current numbers are insufficient, but surely worth trying to develop such identification mechanisms? : would be a very powerful network defense tool to have a signature intercept capable of picking up code by known malware authors...
The best and most accessible discussion of the problem of data classification is in a couple of papers by Tom Fawcett. These deal with something called ROC curves. ROC originally stood for "receiver operating characteristic", referring to the ability of a receiver to classify targets in noise. An analogous phenomenon occurs in pattern matching in digital data, where the term "relative operating characteristic" is used. The following link is good starting point.
http://www.hpl.hp.com/techreports/2003/HPL-2003-4.pdf
The problem boils down to one of true detection and false alarm rates. You can have an arbitrarily high true detection rate if you can live with an arbitrarily high rate of false alarms. You can reduce the number of false alarms to an arbitrarily low level. But, only at the cost of missing an arbitrarily large percentage of true targets.
The phrase "No such thing as a free lunch" is occasionally used in the literature to describe this and ZenCoder's comment is right to the mark.
The problem boils down to one of true detection and false alarm rates
I'm amazed that so many Reg readers have time to post comments like this, given the hours y'all must devote to explaining the finer principles of egg-sucking to your grandmothers.
No one credible does machine-learning research without being well-versed in basic concepts like precision and recall rates. That's, like, week 2 of your Introduction to Machine Learning class.
Admittedly a more-sophisticated understanding of statistics is not universal among ML researchers and implementers; folks like Vincent Granville bang on about that incessantly, and they have a point. But that doesn't mean their work is automatically useless, as is trivially demonstrated by the fact that it's very often put to use. Google Translate, say, may be rubbish compared to human translators, but that hasn't stopped people from using it.
Even a 66% success rate is useful in some applications.
This post has been deleted by its author
This post has been deleted by its author
This post has been deleted by its author
If you see behaviour that you think is a bug, you really don't want to put your whole trust in the comments. After all, the comments were likely written by the person who wrote the code (and the bug, if it exists)
But then again, both the code and the comments may well have been updated by "freestylers".
I too detest uncommented code. And I too come from a FORTRAN background, ultimately. But I not only comment the crap out of my code, I also indent and structure the crap out of it. Maybe I'm just not as smart as you; I want to be able to read it in six months, and I can't instantaneously understand unfamiliar code at a glance. I find it much more readable if it's nicely structured.
I feel that you're just trading one type of laziness for another. Making your code readable isn't just for you; it's also for other programmers. And that's part of what I am trying to do. Seems like you're only doing it for yourself.
And if that's true of your coding style, I have to wonder about your comments.
MHO. YMMV.
Fortunately for the rest of us, there is Artistic Style. http://astyle.sourceforge.net/
Oh yeah. I program in assembly regularly. I don't indent, it doesn't feel natural to the language, but I use plenty of other visual breaks and clues, and I meticulously align the comments such that they're easily readable.
But 'C' and other structured languages, I indent. And use Artistic Style to clean up if I have to shift things around enough that they get crazy.
Neither indentation nor comments are necessarily as useful as is widely believed - for example, most languages have pretty-printing available, after all.
In fact, the best you can hope for is that the comment/indentation is not inconsistent with what the code actually 'means' to the computer.
Indenting is only useful if it shows what the computer thinks, rather than what the programmer thinks.
Example: a bug in a Coral66 program, where the preceding 'comment' lacked a terminating semicolon. So the program statement got absorbed into the comment and was therefore absent from the binary.
I do indent my code, but I hate the convention that has the opening curly brace on the same line as the conditional that spawns it, such as:
if (condition == test) {
....doCondition();
....etc...
}
I always indent my code with the opening and closing braces lined up and on their own lines. It makes code blocks easier to spot as well as spacing everything out for easier legibility, like this:
if (condition == test)
{
....doCondition();
....etc...
}
This way bracket highlighting at the cursor makes both braces instantly spottable at the left, rather than having to hunt across lines of code to find the opening curly brace!
Of course this also wouldn't survive compilation, but anyone seeing my source would peg me as its author since I swear I'm the only programmer I know who insists on arranging my braces this way!
"If you swear that you are the only programmer you know who insists on arranging your braces that way, you either haven't programmed very much, or you don't know very many programmers."
Your second guess is the correct one. I've been programming since 1983, when I started with first CBM BASIC and then 6510 assembler on the Commodore 64, and went on from there. But it's not a particularly sociable lifestyle, and I'm not a particularly sociable man, so I only know a dozen or so programmers.
But whenever I see code on the internet, whether it's stackoverflow, git or SF, it nearly always has opening curly braces following the conditional rather than on the next line. So I'm sure I can be forgiven for thinking I'm alone in this convention!
About 30 years ago at a LUUG (London Unix User Group) meeting in a pub DT asked how an if/else should be formatted. There were 14 of us and 13 different answers; we were all prepared to defend our own style as being the best - all their arguments were wrong since it was obvious that my own style was the only good one.
Many preferences seem to depend on which languages you cut your programming teeth on, how they were laid out.
As regards your examples:
* the opening '{' should be on the line with the 'if', the '}' ends the 'if' the '{' is less important and just makes the if body multi statement.
* there should not be a space after the 'if' - why, in my case, because snobol did not allow it.