Agent God
Lets just suppose for arguments sake that the "OSWorld Benchmark" is useful and representative. (I'm not yet convinced by the paucity of information presented here or from what I could see online with a minute of searching, but just suppose). The article states humans have a score of 72.36 percent, and we must assume that is an for employee of an organization that has been working there for at least a few months and knows the ropes (otherwise it is meaningless). An obvious way to be useful would be to have the "agent" check the humans work and suggest/discuss the most likely changes to move the score up to 72.36 + delta. Humans and AI have different strengths, and the AI should be massively cheaper. Yet, that approach is not even mentioned. Oh, I see, assistant bad, agent good.