Yep, the numbers are horribly bad
It generates a million random answers that are then cut down to a single one, and that one is wrong 57% of the time even in an artificial programing contest situation. Repeated ten times over it might cough up something correct, but who chooses which of those ten attempts is correct? Humans.
This proves Gemini does NOT have the reasoning skills people are trying to attribute to it. It still sucks extremely hard at generating answers, there's no reasoning or understanding there, just randomness. The improvement is in the step that filters out the one answer that's most likely to fool a human into believing it is correct. And even then the human has to do the actual work in the end by running the model several times and filtering out the one answer that is, by sheer luck and the law of large numbers, actually correct.