Model specific results?
From the paper: "[a]lthough we aspired to assess widely recognized commercial models such as GPT-3.5/4,Gemini, Claude, etc., conducting experiments involving 12,000 requests per model was deemed financially prohibitive [...]. Consequently, we opted to utilize models hosted by VMware NLP Lab’s LLM API."
So the results may not reflect the performance of the models most folks use. Indeed the authors found that "As evidenced in the subsequent sections, certain overarching patterns become apparent; however, they do not universally apply to each model across all prompting strategies. We will explicitly illustrate that there is no straightforward universal prompt snippet that can be added to optimize any given model’s performance."
This strongly suggests that there's no underlying rationale for the optimisation of prompts, leading to (my) conclusion that there's no actual (even artificial) intelligence present.