Exactly
If it's code without any strong Type on the web then it's most likely examples of code from Microsoft itself.
There's no such thing as an anonymous programmer: your coding style can unmask you, according to research led by Drexel University Comp. Sci. PhD student Aylin Caliskan-Islam. In work that has serious implications for anyone believing their open source project contributions are anonymous, the researchers find that as many as …
The mainstream idea is that better programmers write shorter and cleaner code which contradicts with line of code statistics
It depends what is meant by 'cleaner' in this instance. Introducing an unneeded variable might be considered 'unclean' but it could improve readability and any half way decent compiler will optimise it out. In my experience short and concise code is harder to read and by trying to be too clever people are more prone to making mistakes.
Unless you're doing embedded coding you can pretty much rely on the compiler to generate better code anyway (especially with languages like C# and Java) so clarity of source is more important than keeping things short.
... there's your problem. Too much room for personality.
K&R C and assembler tell the hardware what to do, and don't leave much room for personality to stand out ... at least not when done right. And the result is a hell of a lot faster (and arguably safer from a security standpoint) than code written in C++.
(Before you try to argue with me, ask Cupertino what their kernel is written in.)
Yes, BSD's kernel is written in C.
If Apple had acquired Be rather than NeXT in 1997 (i.e, if they had made their decision on purely technical grounds, rather than on who the CEO of NeXT was), then their kernel would now be written in C++.
(Haiku, and it's now Open Source, if anyone is curious)
As it is, Cupertino's current device driver and I/O layer is written in C++, and so are many of the low-level libraries unique to OS X.The remainder are C, or Objective-C for less performance-critical ones.
This illustrates only that competent developers use a variety of tools. OS X was not a clean-sheet design; it was actually something of a rush-job as Apple had fallen far behind Sun and Microsoft in OS capabilities and desperately needed to catch up: you have to remember that even in 2000, Mac OS was only co-operatively multi-tasked; so one bad application could often kill your entire system. OS X as released was an amalgamation of many different sources: each was chosen because it was a proven, viable subsystem, not because it was written in the One Holy Language.
But, going back to kernels: The reason why the BSD kernel is written in C is because AT&T's UNIX kernel was written in C, and that was because C was the language that K&R developed specifically to allow their UNIX OS to be portable across AT&T's various system architectures.
FWIW I believe that the Windows kernel is also written in C but its data structures are objects so in that sense it is object orientated.
"You can write OO code in any language if you are perverted enough."
Ah, you've used glib GObject, then...
At this point, I'll have to confess to writing quite a bit of OO code in 68000 assembly, although I didn't recognise it as such at the time. (I even had vtables)
@Dan55 above on C being "faster". It's not. C++ code runs exactly as fast as the equivalent C code - C++ actually offers a good compiler more optimisation opportunities. Non-virtual method calls are simply C function calls, in-function variables are allocated on stack just as in C, and exceptions/RTTI can be disabled if your module doesn't require them. (Just specifying throw(); at the end of your method declaration removes the overhead of exceptions in that method even if the rest of your code uses them). C99 borrowed a lot of its nice features from C++ (it's a shame that C++ took so long to take "null"
Just because C++ lets you quickly write inefficient code (like copy-by-value parameter passing for superlarge types), it doesn't mean that C++ is itself less efficient, just that some people don't know as much about programming as they think they do. (The small consolation of such dumb behaviour is a. it's less likely to cause bugs than naive use of pointers is, and b. you can optimise the problem away later.)
I'm happy to accept the argument that C++ leads developers to use less speed-efficient data-structures like the STL containers for tasks where a hand-rolled equivalent would be superior, and that C++ code can thus be slower as a result. But that's trading dev-time for run-time. Unlike a hand-rolled data structure, the STL version will work reliably straight away; and in general, I heed Dr Dijkstra's warnings about premature optimisation...
"Uh ... no. Show me real-time code that is written in C++."
Uh ... yes:
http://en.wikipedia.org/wiki/Symbian
Hey, here's two more that you can see the code for:
http://scmrtos.sourceforge.net/ScmRTOS
http://miosix.org/index.html
Whether a kernel is C or C++ depends more on when it was started than anything other factor. C++ is a superset of C; anything that needs C for "efficiency" is just as possible with C++ code.
Symbian? There's a fail. And it was/is all K&R C.
scmRTOS? That's all K&R C.
miosix? "supports" C++ ... Straight C otherwise.
C++ is indeed a superset of C. That doesn't mean that everything compiled with a "C++" compiler is actually written in C++.
Ha! Does your voice get muffled when you sit down, jake?
We mere mortals don't have your custom build of K&R that can compile namespaces, the 'this' keyword, variable instantiation within sub-scopes, default-value initialisation, function calls using the dot and pointer operators, and templates.
You haven't even looked at the code, have you? There is no C compiler in existence that can compile the projects I cited. That is because they are written in C++.
For the record, there's is also a difference in the output of "C code compiled with a C++ compiler", and "C code compiled with a C compiler". If something is written in C, we will use a C compiler to compile it, because that preserves the other assumptions about C code (particularly symbol naming, but there are other, more subtle differences).
"The remainder are C, or Objective-C for less performance-critical ones."
So the "performance-critical ones" are written in C? Seems to negate the entire rest of your post. Think about it.
"As it is, Cupertino's current device driver and I/O layer is written in C++, and so are many of the low-level libraries unique to OS X."
And the bugs creep in where, exactly? It ain't in the kernel ...
"But, going back to kernels: The reason why the BSD kernel is written in C is because AT&T's UNIX kernel was written in C, and that was because C was the language that K&R developed specifically to allow their UNIX OS to be portable across AT&T's various system architectures."
What you are forgetting (or ignoring) is that nobody has invented anything better than K&R for cross-platform kernel development. It's not inertia, it's reality.
Err.. NO
You can recognize a personality in any language. It takes me a few split seconds to look at a piece of the Linux kernel code and say Al Viro, Theodore Ts'o or "Not Alan Cox again..., time to look for an obscure logic error somewhere". That is C for you as an example.
Similarly, I can recognize in a split second the style of various people I have worked with in python, perl, java, etc. Even projects that have vicious style requirements (kvm/qemu) still show distinctive style of key contributours making the author instantly recognizeable.
What is more interesting is how does this handle the evolution of the person's coding technique over time and over project changes. For example - my code prior to working on kvm/qemu for a while and after looks and reads like written by different people.
Anonymous... Just for the fun of "recognize me programmatically" :) By el-reg posting style...
"Err.. NO
You can recognize a personality in any language."
Likewise in mainframe assembler. I can recognise fellow coders by the instructions they use (versus alternates), the way they structure their logic. The most telling thing, though, is the 'shape' of the code and the commenting style - verbose or no comments, instructions and comments neat or higgeldy-piggeldy.
It does help, though, that we use initials in comment tags, so 40 years of modifications to a program can be laid bare...
That competitors who complete more tasks in code competitions have, on average, longer programs than those who compete fewer tasks is not surprising. Mark Twain is attributed for ending a letter with "I apologize for the length of this letter. If I had had more time, it would have been shorter". The same is true for programming; It takes more time to write shorter code. It is often faster to cut-and-paste and do local modifications than to make a parameterised procedure to cover all cases, and sometimes it is faster to special-case on different inputs than to make a general solution, which often requires insights that take too long to obtain when you are pressed for time. And you certainly don't want to spend time on simplifying code that works. Good competition programmers also often have a standard skeleton program that they modify for each task, because it is faster than starting from scratch. So there will often be procedures that the programmers do not bother to remove even if they don't use them. They don't harm, so why use time to remove them?
Coding competitions are very different from normal programming: The problems are small and self-contained, so you don't have to worry about modularisation or readability of the code (in a few hours, nobody will ever look at the code again), and the process is more explorative than normal coding. So you can't draw conclusions about general coding style from such competitions.
Most code obfuscation is done at the lexical level: whitespace and comments are eliminated, variables and procedures are renamed, macros are expanded, and so on. As mentioned in the article, such tools can not hide coding style, as this goes far beyond lexical details. So a good obfuscation tool must work on the semantic level of the program: It must replace code with semantically equivalent code using more than just local syntactic or lexical transformations. This is very difficult to do, especially if the language semantics is loosely specified (*cough* C *cough*). Writing such a tool is (at least) as complicated as writing a compiler, which is why it is rarely done. But there is research that points the way: http://dl.acm.org/citation.cfm?id=2103761
The article suggests that obfuscation tools can't beat their analysis, but as one of the things their analysis uses is lexical information, then presumably obfuscation at least makes it harder to get a match, although still not impossible - presumably it would need larger samples to work on? If it works perfectly well without lexical information, then why waste time using lexical information in the first place?
Is it really that hard, assuming you don't mind discarding a few obscure optimisation opportunities? LLVM already has a C back-end.
The paper you pointed at makes a program "harder to understand or analyze". But that's presumably not required for merely disguising the author.
We tried commercial semantic obfuscators. The result passed all our regression tests but really pooched our benchmarks -- we sold scientific libraries that were *very* time critical -- so we stuck with lexical obfuscators for source sales.
Now I am intrigued enough to actually read the original paper.
Most code obfuscation is done at the lexical level
Most of the obfuscation in the code I'm looking at just now seems to have been done by someone who has no business with a compiler. Any metric that determines this lengthy, monolithic spaghetti code is more productive than a properly architected version with a tenth of the code, is a metric that discounts the software life cycle in favour of only initial coding time.
12 projects full of 2000+ line classes, everything is concrete, nothing is abstract, there are no interfaces, no patterns... I'm not sure an obfuscator could make things any more difficult to work with. It's just awful.
I wonder how easily would this tool identify someone who knows he might be profiled and consciously tries to stay away from some of his own known habits. Obviously, trying to do this many different times would be a short route to the nuthouse but perhaps it would work for one or two specific known-dangerous things to contribute to, as a departure from one's "normal" coding style. You know, start using else-ifs instead of switches, suddenly pick up a preference for Hungarian notation, pass everything through GNU indent at default settings, that sort of thing...
It's doesn't take long when working on a team project where you're seeing code from other people to learn to recognize which code came from whom. It's kinda neat they they can teach a computer to do it, but I've been in plenty of situations that demonstrate the concept before. You know it's going to be a long day of bug hunting when you recognize a particularly bad programmer's style.
"You know it's going to be a long day of bug hunting when you recognize a particularly bad programmer's style"
Absolutely agree. Been in my current employer for 17 years. There are a few coders (shell, java, app config, COBOL (yes its still here) and JCL, C, C++, Python, Perl and (ick) Ruby) - that I know by sight. Some have left and I still know "who wrote that". -- there was a project recently that had been "dropped on the ground" by management shuffles, at some point it became a critical issue and, well, from a comment style in an apache virtual host config I knew who had set it up....... and don't get me started on the tomcat application config -- *that* told me what team had put it in.....
(I might be a linux nut, but yes, I still look at the mainframe once in a blue moon too)
its techniques could be used to identify plagiarism among computer science students
I found that laying one printout next to the other was an adequate technique!
Though, it is true that the spaces and tabs were a giveaway, when the indentation was, shall we say, merely decorative.
Although I'd imagine that you'd get a lot of false positives as a lot of students would have very similar coding styles to their teachers and the fact that they are all trying to solve the same, usually trivial, problem, I'd imagine a lot of them would look very similar to others.
It can go in the other direction too. My coding style certainly changed over the semester as I learned what worked for me and what didn't. Begin/end and brace placement changed heavily over the semester, as did my indentation width.
It happens as one learns a new language as well. I could definitely look at an early program in a new-to-me language and tell what other language was influencing my style at the time.
I found that laying one printout next to the other was an adequate technique!
Though, it is true that the spaces and tabs were a giveaway, when the indentation was, shall we say, merely decorative.
Those things are easily altered. I can't help but think back to my own time at Uni - the University of Manchester - where all code went into John Latham's ARCADE system that detected any plagiarism. This is over 20 years old now.
He did explain how it worked a couple of times and although he never used those terms it seemed to perform a lexical analysis first and then consider the resulting token stream. Comments, white space, variable names etc were thrown out straight away as trivially easy to alter. Instead it simply looked at a sequence "identifier, multiply, constant, terminator..." that is much more difficult to alter in a non-trivial manner since it is intrinsically linked to how the program works.
int main(enter the void)
...