it was collected from entries submitted to two programming contests
I find the number of entries to these contests rather staggering, but this is the ultimate form of design by committee. If choosing only on the basis of correct output from input, the system is going to label no end of bad, misguided and downright risky techniques and solutions as "good".
And I don't want that anywhere near my software, either the stuff I write or run.
I'm still sceptical about many of the efforts in this area, as to me good programming is much more about clean design than simply coding to produce expected output, and good work is much better identified by someone who appreciates this. At the very least I'd hope they trained it exclusively on examples which skilled people agree are good and demonstrate best practices.