back to article IBM compiles dataset to teach software how software is made: 14m code samples, half of which actually work

IBM has assembled a massive silo of source code for teaching machine-learning programs about programming. Dubbed Project CodeNet, the set contains, we're told, 14 million code samples totaling 500 million lines in more than 55 programming languages, from Java, C, and Go to COBOL, Pascal, and FORTRAN. Truth be told, more than …

  1. vtcodger Silver badge

    Work Product?

    ...half of which actaully work

    Is the reg's spell checking perchance done by one of CodeNet's work products?

  2. Pete 2 Silver badge

    correct, secure, fast - choose one.

    > About half of the samples work as expected (hopefully the authors did not expect it to fail?)

    Functionality is nice, but to do it securely is better. If this IBM data can be used to re-write code so that it is hardened against hacks, then it might have some use.

    And best of all, is if the code can be made to work efficiently and without bloat.

    1. This post has been deleted by its author

  3. don't you hate it when you lose your account

    Full disclosure

    Is the horizon post office code in there. And will it be used for ethics training

  4. Whitter

    AIs are very context sensitive

    So they intend to build an AI that might help me code one of two specific tasks?

    1. RobLang

      Re: AIs are very context sensitive

      Not necessarily. It depends on the encoding of the training set. If you encode such that the neural network learns only the structure of the test then it will do just as you say. However, if you encode the training data such that it learns the grammatical forms of function and process then it should be able to spot good and bad programming without the context of the tasks that IBM set.

  5. steelpillow Silver badge

    Flying a Kite

    All IBM need to do now is to specify CodeNet in terms of inputs and expected outputs, and feed it back to itself.

  6. Howard Sway Silver badge

    it was collected from entries submitted to two programming contests

    I find the number of entries to these contests rather staggering, but this is the ultimate form of design by committee. If choosing only on the basis of correct output from input, the system is going to label no end of bad, misguided and downright risky techniques and solutions as "good".

    And I don't want that anywhere near my software, either the stuff I write or run.

    I'm still sceptical about many of the efforts in this area, as to me good programming is much more about clean design than simply coding to produce expected output, and good work is much better identified by someone who appreciates this. At the very least I'd hope they trained it exclusively on examples which skilled people agree are good and demonstrate best practices.

  7. trevorde Silver badge

    What IBM did next

    Use this AI to automatically write their offshored code, thereby making more offshored staff redundant. Code quality is still poor but customers do not notice. IBM announces record profits. Ginni gets a new helicopter.

    1. yetanotheraoc Silver badge

      Re: What IBM did next

      You nailed it. They are training the AI to copy/paste code, so the code quality will be exactly the same as the current outsourced solutions. When the code doesn't work as expected (you can read that two ways), they will train the AI to ask questions on stackoverflow.

  8. picturethis


    It seems like this effort is akin to converting a PLC's (programmable logic controller) ladder logic into another programming language or vice-versa. I don't quite see the utility of this. If they're just trying to covert X number of inputs to Y number of outputs, then it is just a state machine. State machines can be very elegant (but those are usually quite obfuscated) or can be very inelegant (and usually easier to understand). Computers are very good for predicable behaviour (even without AI). Granted most, correctly written, software, excluding AI, can be distilled down to gigantic (predictable) state machines. Lucky for us humans.

    To extend this thought, this is what FPGA tools already do. Take verilog / VHDL and turn it into a set of bits that define a huuuuge state machine that runs in the FPGA logic gates. Again, this has been done.

    Taking examples from human programming for examples of (good) security just seems..... wrong. We (humans) aren't very good at that.

    And lastly, why would computer (AI) generated language (designed by humans) to run on a computer be desirable? Once it's generated, humans are going to review and comment on the correctness, after the AI has already generated it based on learned examples (from potentially billions of input examples - both good and bad)?

    We go from:

    problem -> human -> (programming) language source -> preprocessed -> compiled machine code of choice

    And with AI:

    problem -> AI -> (programming) language source -> preprocessed -> compiled machine code of choice

    Why not just:

    problem -> AI -> compiled machine code?

    We exist to serve our AI overlords.


    Countdown to ...

    Skynet? We should have a Skynet Clock similar to the Doomsday Clock. I'm actually only half joking.

  10. Anonymous Coward


    Looking at the first two pie charts in the paper, the Python and Wrong Answer pie slices are almost identical.

    Coincidence? I think not.

  11. spireite Silver badge

    This is not new to IBM......

    I've been witnessing their code quality for years, they've just announced it, after 10 years of trusting it ......

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like