Re: The limitations
The introduction has all the meat, really. To summarize: they saw that OpenAI had done a fairly realistic study where they took *actual* real-world jobs from upwork/freelancer, and tried to get LLMs to solve them. They saw that the LLMs didn't do very well: "the top-performing model in OpenAI's study solved only about 26% of the independent coding tasks and 45% of the management tasks".
So they decided they'd do a much worse study which would give the LLMs bigger numbers. I am not making this up.
"In this paper, we propose a new evaluation approach that draws inspiration from SWE-Lancer but emphasizes automation and repeatability. Instead of using actual freelance projects that require human evaluation, we leverage a publicly available dataset of freelance job postings to generate synthetic coding tasks with ground-truth solutions. In particular, we use the Freelancer.com dataset by Oresanya et al. (2022), which contains ~9,193 job postings in the data analysis and software domain. We filter and process these job descriptions to create well-defined problem statements (e.g., data processing tasks, scripting challenges, algorithm implementations) that an LLM can attempt to solve. Crucially, for each task we provide a set of test cases (input-output pairs or assertions) so that a solution's correctness can be validated automatically, without human intervention"
so they took real freelance jobs and massaged them into AI benchmarks, then said "look! the AIs did quite well!"
Good lord.