
Happy new year, and here's a pint for you guys for the reference to the original book form of "Total Recall"...!
EDIT: ...also, I suppose these good folks just proved beyond a shadow of a doubt that programming IS indeed a form of art.
Anonymous programmers, from malware writers to copyright infringers and those baiting governments with censorship-foiling software, may all be unveiled using stylistic programming traits which survive into the compiled binaries – regardless of common obfuscation methods. Youtube Video The work, titled De-anonymizing …
given the subject, this brings to mind a short story by PKD (can't recall the title) where a robot investigates a murder scene and gradually narrows down the possible culprits by trawling a database of characteristics for the population vs evidence it finds at the scene. I believe that it turns out that the whole evidence set was being gamed from the start.
The question is, how thorough can your investigation be?
With a dev environment and code on a tiny removable storage device. Only the incompetent are going to be caught. Perhaps that is enough though.
This is not hand-writing. I can't see why those statistical markers cannot themselves be reverse-engineered and used to obfuscate the authors code.
I can't see why those statistical markers cannot themselves be reverse-engineered and used to obfuscate the authors code.
Oh, they can. On a sufficiently-fast system, you could even do this in an IDE or toolchain in nearly real time: as the programmer makes changes, the system could show what fingerprints the model identifies in it, and even suggest changes that would alter the fingerprint.
There are a few obstacles:
- It's some effort to implement something like this. Most people are too lazy, even if they're capable, or simply don't care - it's not a prominent aspect of their threat model.
- It's resource-intensive. Modern IDEs already soak up an idiotic amount of CPU time, I/O bandwidth, etc. Will programmers (of whatever motivation) feel like applying resources to the problem?
- Your adversaries may be using different models. That could mean anything from sufficiently-different training data, to different feature sets, to entirely different classifiers.
>They don't scale, no. But if there are 50 coders in a company and a hacker's style matches one of them that person can expect a more thorough investigation.
Yeah but you're dealing with the hacker mentality - they'd be emulating a colleague's ever-so slightly flawed attempt to copy another colleague's style.
Not only do these methods not necessarily scale, they need an ever increasing ground truth of identified code for training. This is not trivial to obtain. Besides, as more and more coders are added, you have to worry about the number of degrees of freedom in coding anything, i.e. are there enough different coding styles to distinguish the millions of coders on this planet. Besides, you have to deal with code developed by teams (which is the normal situation), which will either show a mixture of styles, or predominantly show the style of the loudest mouth in the team, with a small admixture of the other members. Similarly, what happens when a new coder refactors old stuff? I know I have seriously refactored a program written by some students to adapt it to new use cases. It is still not really like my
You could of course show that a certain style is consistent with a known sample of some hacker's work, but even then people might slowly change their coding style. Having had a look at some of my earlier efforts, I know I have changed style a great deal (thank goodness ;-)), if only by incorporating OO techniques
Not only do these methods not necessarily scale, they need an ever increasing ground truth of identified code for training. This is not trivial to obtain.
No. Unsupervised learning by kernel extension with noisy input is a well-researched and broadly successful area.
And additional input is extremely easy to obtain.
I agree that the current numbers are insufficient, but surely worth trying to develop such identification mechanisms? : would be a very powerful network defense tool to have a signature intercept capable of picking up code by known malware authors...
The best and most accessible discussion of the problem of data classification is in a couple of papers by Tom Fawcett. These deal with something called ROC curves. ROC originally stood for "receiver operating characteristic", referring to the ability of a receiver to classify targets in noise. An analogous phenomenon occurs in pattern matching in digital data, where the term "relative operating characteristic" is used. The following link is good starting point.
http://www.hpl.hp.com/techreports/2003/HPL-2003-4.pdf
The problem boils down to one of true detection and false alarm rates. You can have an arbitrarily high true detection rate if you can live with an arbitrarily high rate of false alarms. You can reduce the number of false alarms to an arbitrarily low level. But, only at the cost of missing an arbitrarily large percentage of true targets.
The phrase "No such thing as a free lunch" is occasionally used in the literature to describe this and ZenCoder's comment is right to the mark.
The problem boils down to one of true detection and false alarm rates
I'm amazed that so many Reg readers have time to post comments like this, given the hours y'all must devote to explaining the finer principles of egg-sucking to your grandmothers.
No one credible does machine-learning research without being well-versed in basic concepts like precision and recall rates. That's, like, week 2 of your Introduction to Machine Learning class.
Admittedly a more-sophisticated understanding of statistics is not universal among ML researchers and implementers; folks like Vincent Granville bang on about that incessantly, and they have a point. But that doesn't mean their work is automatically useless, as is trivially demonstrated by the fact that it's very often put to use. Google Translate, say, may be rubbish compared to human translators, but that hasn't stopped people from using it.
Even a 66% success rate is useful in some applications.
This post has been deleted by its author
This post has been deleted by its author
This post has been deleted by its author
If you see behaviour that you think is a bug, you really don't want to put your whole trust in the comments. After all, the comments were likely written by the person who wrote the code (and the bug, if it exists)
But then again, both the code and the comments may well have been updated by "freestylers".
I too detest uncommented code. And I too come from a FORTRAN background, ultimately. But I not only comment the crap out of my code, I also indent and structure the crap out of it. Maybe I'm just not as smart as you; I want to be able to read it in six months, and I can't instantaneously understand unfamiliar code at a glance. I find it much more readable if it's nicely structured.
I feel that you're just trading one type of laziness for another. Making your code readable isn't just for you; it's also for other programmers. And that's part of what I am trying to do. Seems like you're only doing it for yourself.
And if that's true of your coding style, I have to wonder about your comments.
MHO. YMMV.
Fortunately for the rest of us, there is Artistic Style. http://astyle.sourceforge.net/
Oh yeah. I program in assembly regularly. I don't indent, it doesn't feel natural to the language, but I use plenty of other visual breaks and clues, and I meticulously align the comments such that they're easily readable.
But 'C' and other structured languages, I indent. And use Artistic Style to clean up if I have to shift things around enough that they get crazy.
Neither indentation nor comments are necessarily as useful as is widely believed - for example, most languages have pretty-printing available, after all.
In fact, the best you can hope for is that the comment/indentation is not inconsistent with what the code actually 'means' to the computer.
Indenting is only useful if it shows what the computer thinks, rather than what the programmer thinks.
Example: a bug in a Coral66 program, where the preceding 'comment' lacked a terminating semicolon. So the program statement got absorbed into the comment and was therefore absent from the binary.
I do indent my code, but I hate the convention that has the opening curly brace on the same line as the conditional that spawns it, such as:
if (condition == test) {
....doCondition();
....etc...
}
I always indent my code with the opening and closing braces lined up and on their own lines. It makes code blocks easier to spot as well as spacing everything out for easier legibility, like this:
if (condition == test)
{
....doCondition();
....etc...
}
This way bracket highlighting at the cursor makes both braces instantly spottable at the left, rather than having to hunt across lines of code to find the opening curly brace!
Of course this also wouldn't survive compilation, but anyone seeing my source would peg me as its author since I swear I'm the only programmer I know who insists on arranging my braces this way!
"If you swear that you are the only programmer you know who insists on arranging your braces that way, you either haven't programmed very much, or you don't know very many programmers."
Your second guess is the correct one. I've been programming since 1983, when I started with first CBM BASIC and then 6510 assembler on the Commodore 64, and went on from there. But it's not a particularly sociable lifestyle, and I'm not a particularly sociable man, so I only know a dozen or so programmers.
But whenever I see code on the internet, whether it's stackoverflow, git or SF, it nearly always has opening curly braces following the conditional rather than on the next line. So I'm sure I can be forgiven for thinking I'm alone in this convention!
About 30 years ago at a LUUG (London Unix User Group) meeting in a pub DT asked how an if/else should be formatted. There were 14 of us and 13 different answers; we were all prepared to defend our own style as being the best - all their arguments were wrong since it was obvious that my own style was the only good one.
Many preferences seem to depend on which languages you cut your programming teeth on, how they were laid out.
As regards your examples:
* the opening '{' should be on the line with the 'if', the '}' ends the 'if' the '{' is less important and just makes the if body multi statement.
* there should not be a space after the 'if' - why, in my case, because snobol did not allow it.
I could draw (paint?) analogies between programming and being a painter (for example). Art historians can identify painters by characteristics of a painting, including how a painter's style evolves over a period of time. There are problems with attribution though with students of a painter adopting similar characteristics as their tutor, and the input that students have of helping the master with the incidentals of a work of art (e.g., Gainsborough getting his assistants to do the landscape background while he concentrated on the portrait). Then there are new techniques: new types of paint and canvas (Hockney moving from conventional canvas to photographic collage and then to tablet being a good example) which necessitate a change in style - analogous to a new or updated programming language installed on a different pc or with a different targetted platform.
As mentioned earlier Stack Overflow copy and paste is an example of how things change in the programmer's world. A piece of coding that is homogenously constructed, sporadically interspersed with anachronistic styles where sites such as Stack Overflow have been dipped into for inspiration. Then future works by the same programmer where those code snippets are bedded-in to the coder's customary style.
"we can de-anonymize them from optimized executable binaries with 64 per cent accuracy."
That's slightly better than I can do if I flip a coin - let's look at this from a different angle - there are 10 hamburgers in front of you, 3 or 4 four them have botulism ... are you hungry?
If I have a suspect pool of 20 or 30 programmers then identifying the author with 68% accuracy would very useful - in a sinister way. I'm sure you could use the technique to rank the authors in order from most likely to least and then investigate further from the top. A lot more efficient than simply investigating everyone.
As the presenter points out in the video, governments have used similar techniques to identify and prosecute programmers that contributed to "illegal" websites.
A lot more efficient than simply investigating everyone.
Yess - and with Machine Learning it is pretty damn hard to work out how the machine actually reached it's conclusions (it's a research subject), which makes it all the more easier to fudge the results to narc out "the right people" and get away with it too. Especially if we are not exactly talking legal proceedings but are more in the territory of "no flight lists" and "signature strikes"*.
Remember, If a computer says something is true, it is!
*) Or maybe not, plod is as dumb as a sack of broken hammers when IT is involved.
"You've got a 100-sided coin that lands the same way 64% of the time?"
OK - let's put it another way - of the 100 people investigated, and charged with writing the infringing application, 34 of them will be completely innocent and the chances are not good that 32 of the others had anything to do with the application either.
OK - let's put it another way - of the 100 people investigated, and charged with writing the infringing application, 34 of them will be completely innocent and the chances are not good that 32 of the others had anything to do with the application either.
I cannot for the life of me figure out what scenario you are describing, but it doesn't appear to be at all related to anything described in the paper.
First, they're talking about single authorship, so of the hypothetical "100 people investigated" (by, apparently, the world's least-competent police force), only zero or one would be guilty, and at least 99 would be innocent.
Second, let's assume the 0.64 accuracy rate does extend to some pool of 100 candidates that the model has been trained on, and the single guilty party is among them. The classifier is presented with input and indicates candidate A is the closest match. Disregarding all other factors, for some reason, the investigators interview candidate A. There's a 0.64 chance they have the guilty party, and a 0.36 chance they don't. So what? It's a place to start. Picking a starting interviewee at random has only a 0.01 chance of being correct, so they've improved their odds significantly.
Third, the hypothetical suggestion that someone might make stupid decisions based on weak evidence doesn't negate the importance of that evidence. A Perfect Bayesian Reasoner already knew it was weak, and treated it as such. Any other process for accounting for that evidence is inferior, but that's not the fault of the evidence. Nor does that suggestion vacate the importance of the mechanism used to extract that evidence, or of the research that led to the mechanism.
We see in posts like yours a typical Reg commentator fallacy: if there's any objection that can be raised to research, then that research is useless. It's tiresome, sophomoric anti-intellectualism.
That's slightly better than I can do if I flip a coin
It's significantly better, and that's only with two alternatives. If they're identifying among a pool of 3 candidates with 64% accuracy, then they're almost twice as accurate as your coin. And so on.
And for this paper, their pool of candidates was 20 programmers.
But thanks for playing.
I'd much rather have well commented code than indents and my editor of choice can soon indent code for me if I need it too. Personally I use both but then I had to support someone else's code from a very early age who never put a single comment to their code, never mind followed change control procedures (which were very minimal), did very limited testing, but management though the world of them, it was also the age of short variable names which didn't help, overall I was very lucky in the enviornment I worked in then as it taught me many lessons which I used throughout my career to improve procedures, fault solve and debug. My comments are for me as much as anyone else as I don't expect to instanlt remember n years later why I did something in a particular way which could often be due to a bug in the compiler or os at that time.
Happy New Year, One and All. And does not the tale we comment on here not advise us that all systems are vulnerable, and both practically and virtually indefensible and therefore always susceptible to disruptive exploitation which in extremis can be command takeover and makeover controlling?
And there is nothing really effective to be done to halt the progress?
Methinks, we all know that it does. And that makes for interesting future space place programming. :-)
So what Messr Aylin is saying is that when I write my nefarious program of dastardlyness, I should run it through a source filter first to emulate someone else's coding idiosyncrasies (like 1980s_coder's lack of indentation) or less maliciously, run it through a source minifier?
Hmmm...
Er... indentation and 'minifying' won't affect the compiled code one iota. To make your code look like somebody else's code you need to *think * like they do. (Insert reference to bad Client Eastwood movie 'Firefox' here).
When it comes to pasting stuff from StackOverflow (otherwise known as "I'm an incompetent freelancer, please do my job for me") - I doubt many malicious coders would go there. Cracking the problem is all the fun for them, and they tend to work alone and very idiosynchraticly. OTOH they're the people most likely to find a way to anonymise themselves against this type of analysis. Probably won't be long before we see 'GACC' (Gnu Anonymising C Compiler) appear....
Hi, anonymous boring coward,
Research being publicised and published is what terrifies systems in operations for command and control, and why captive mainstream media outlets are so terribly entertaining rather than surprisingly educational.
HoweverAn ignorant world is an increasingly dangerous place though, and especially so for the likes of that and those responsible for grand deceits and failing virtual reality programs?! ...... New American Century Projects ..... because such a state of ignorant ruling affairs is not natural or acceptable to wiser beings minded to change things remotely and relatively anonymously
And who and/or what be the postmodern, latter day Hitlerian Saints and Immaculate Sinners in those Versions with Vision and Provisions for New World Order Programming ......... Mass Premeditated and Premoderated and Mediated Mind Command and Control? Any concrete ideas or wild crazy guesses?
IT and they haven’t gone away, you know, ….. such as would be with AI, Immaculately Resources Assets of Universal Virtual Force, although certainly quite different from what one may have presumed to be leading from before.
just adjust yur style two throwoff the analisys' in those cases when ur writing malwaresimplest thing in teh world. ... Anonymous Coward
Although, of course, in not such an alarmingly different manner as that, AC, if one is destined to be really effective and remain continually highly disruptive, buried deep and delving within deserving systems and/or failed exclusive executive order administrations.
The crack magic trick is, is it not, to be practically invisible and virtually omnipotent/anonymous and almighty, and that has one appearing to be most meek and unrecognisable in plain text sight. Then can there be heavenly fireworks with immaculate displays of alternative explosive worth.
Such does make one though, in the eyes, hearts and minds of those in the know and in the need to know, both extremely valuable and marvellously dangerous. It is not a pleasant place or comfortable space for anyone or everyone.
>just adjust yur style two throwoff the analisys' in those cases when ur writing malware
simplest thing in teh world.<
Which reminds me: Think of a program as being an iceberg: the majority of it lies underneath the visible surface as regards those that interact with it (the average user of that app). But what is on the surface can sometimes give some good clues as to what lies beneath. If the person I have quoted above (sorry to pick on you m8, but you are AC anyway so unidentifiable, and I have a feeling you've adjusted your style to demonstrate your point, you're really William Shakespeare aren't you?) were to be a malware writer then they need to pay attention to detail - If they were hacking a banking app I don't think people would be inclined to believe your request to "Clik hear 2 verfy who u r". Sometimes with spam emails it is possible to identify, not just from the occasional typo but by sentence construction, not just that this is a scam, but the nationality of the scammer.
There was a phase where malware was put through something like UPX to obfuscate its contents, but anyone trying to work out the legitimacy of such executables on their pc's could use a hex editor to look at the headers (is Microsoft using UPX now? I don't think so ((presses delete key))). I think anti-malware software reaches a similar conclusion.
Here's some bashing, because this really deserves it. Something like this could only be dreamt of and started by those who doesn't understand programming.
1) 64% chance to deanonymise a small sample set of hand-picked 100 programmers presumably with wildly distinctive ways of programming is utterly useless. How many programmers are there in the world, I very much expect the accuracy to drop off a cliff past a certain point.
2) Programmer's coding style evolve, they evolve as they get better at it, they evolve when hardware changes, they evolve depending how much alcohol intake they had.
3) Right now their accuracy is as it is, but I presume this changes drastically depending on what compiler they use. As compilers get even better at optimising, their accuracy will drop.
4) Sure, there may still be "traits" like one programmer prefers one data structure or control structure more than another, but let's not forget how many programmers or libraries one can use. It'd be completely pointless to predict a binary compiled that's 80% from opensource libraries and I expect the accuracy will drop even further.
5) None of this helps authorities to catch or identify those reponsible. The sophisticated ones, will learn to mimick, like how they're just as likely now to write "chinese/russian/english code comments" leftovers or originate from a "North Korean IP". The sly ones _NEVER_ makes it obvious it's them.
Common sense and logic will be able to tell you all this without going into however much resources has been poured into researching this.
Half or much of the stuff about "cutting-edge" computer security threats are snake oil. Served either to gain more funding from fear or political purposes to pass liberty eroding legislations.
I was interested to see this raise so much attention. I'd thought the pretty printer tools meant you could code how you liked and then format it how your organisation, or team leader, or girlfriend's dad, would find acceptable. Me, I like the vertical alignment of {}, but I'm old enough now to realise that's just me, I can't instantly see the opening brace that goes with a particular closing one unless its directly above. But modern editors solve that, highlight one brace and it highlights the other, however aligned. And if I have to work on something for long I can always pretty-print it "my" way to make it easier, and - theoretically - re-pretty-print it with a different set of preferences afterwards.
I'm left with only one major gripe, and that's Python. Where indentation is part of the language, well, I thought that was a bad idea in makefiles, and see no excuse for it anywhere. Mr Python wanted to impose his own indentation preference, and didn't like that fiddling punctuation noise, well, IMHO a crappy set of requirements for a language. I'm disappointed to see it hasn't faded into the obscurity it deserved.
While I'm here, I have a minor aversion to anything "optional". Semicolons in scripting languages, that sort of thing. To me there should be just one correct way to write the syntax, not a lot of woolly alternatives that produce the same compiled code. Names excepted, of course. I used to like Java til it got over-bloated (around Java 2 or so), I liked Pascal and Modula once-upon-a-time, and now I like Erlang, which has the nit-pickiest compiler I ever met, but once you know the syntax it's trivial, and there's never any doubt about whether you need a punctuation character or can get away without it.
I like to keep the entire language spec in my head. Good luck doing that with C++
:-) Quite so, SISk. AIMagiCQ roads are Absolutely Fabulous Fabless Advanced IntelAIgent Route and AIRoutes to Perfect Enough Virtual Reality Root in All Manner of Master Spider Webs ....... Phormer Networks with Exclusive Orderly Executive Administration Rights and Ab Fab Fabless Permissions.
For All Manner of Virtualisations in Future Presentations ........ Expanding Time Lines .... MagiCQ Trails in Immaculate Tales?:-)