I don't like saying i told you so, but ...
I frequently rail against claims made in AI papers that if the ground truth contains a percentage of errors, any AI system trained on them is likely to end up with a similar actual error rate. I have seen people claim an increase in performance from 97.6% to 98.1% (error bars not included) on data sets where there are two ground truths, drawn up by to medics, which are at odds with each other. In our own earlier work, we managed to get a sort of pareto optimum of 92.5 ± 0.6% on both ground truths, but were in places penalised for finding blood vessels the doctors had missed. It turns out, somehow ground truth 1 has been elevated to The Ground Truth, and the other demoted to "a human observer". And now AIs are better than the poor "human observer" simply because they have been taught to copy all the mistakes the other human has made.
If ImageNet contains up to 6% error, I will continue to take all claims of 99% or better performance with a considerable pinch of salt. Furthermore, if error bars ar not included, how can they claim to be better than an earlier method if the differences are sub 1%.
I am not saying deep learning and CNNs are useless, it is just that sloppy science does them a disservice.