Metrics
Glad to see you mention metrics. Measuring performance in any complex AI task is something of a challenge, and one that tends to come with a powerful SEP field that means those few in the research community who notice there's an issue are far too maverick to get funding.
Many years ago I did postdoc research in the complex AI task of computer speech recognition. I was one of the few who looked at how we and every other research team were measuring (and publishing) performance, and our hopelessly meaningless use of concepts like "accuracy", and tried to suggest more meaningful metrics. The basic SEP everyone ignored was that a system that could perform an easy task (like distinguishing digits 'zero' to 'nine'[1]) with a very mediocre 95% accuracy was rated better than one that achieved, say, 75% in a more challenging task like transcribing natural language dictation[1], let alone a stunningly impressive 25% in following threads in a cocktail party[1].
Lesson: take all reports of how such systems perform with more than a pinch of salt. Ideas like "percent accurate" need more context than you'll ever get to become meaningful.
[1] These tasks are not really representative of what I'm talking about, but to go into detail would be serious levels of TMI. I guess that's a variant of the same problem the journos face when reporting on facial recognition.