70% correctly answered!
"When we let LLaMa-3-8B answer the quizzes it generated, it usually answered three or four of the five questions correctly when allowed access to Google results – which isn't half bad but is, well, cheating. "
I suppose LLaMa-3-8B only being able to correctly answer 70%* of the questions it set isn't half bad, it's just 30% bad. Still only 70% answered correctly is certainly bad, it may even be terrible.
LLaMa-3-8B creates a correct choice and 3 incorrect choices as answers to the questions it creates. If LLaMa-3-8Bs chooses what was generated as the correct choice and the wrong answers are because the correct choice was wrong that's just bad. But if LLaMa-3-8B is wrong because it choose one of the 3 incorrect choices that's terrible. Because if a model can't generate the same answer to the same question it's useless.
LLaMa-3-8B creates a question and it's answer and 3 statements that are not the answer. Then when LLaMa-3-8B is asked the same question it fails to choose the same answer. That's terrible, not because 30% are wrong, but because it gives different answers to the same question. How can you trust a model that gives different answers to the same question when the model's data it was built on has not changed.
* 3 or 4 correct answers out of 5, I took the middle ground 70%