
This is why tool calling I feel, is a gimmick. We'll eventually figure out LLMs are only really *part* of a brain, not the whole brain, and will have to be integrated into other systems that can make decisions on it's behalf. I imagine as datasets for actual actions/behaviors start to form and grow, we'll start seeing models actually made special purpose for comprehending tasks, and not just an over-engineered text prediction algorithm.
I'm working on an assistant right now as a hobby project, and while I'm actively trying to shove as much use of a (local) LLM into it that seems reasonable, I'm honestly kind of struggling to find good use-cases for it. Definitely highly language-centric tasks, but most of the LLM use is just classifying text or summarizing large amounts of structured information. In other words, inputting into a mechanical system and summarizing it's output. No tool calling whatsoever, the LLM isn't ever making a decision outside of classification. It's honestly kind of depressing. I was never under the illusion that LLMs were anything other than stochastic parrots, but it's consistently frustrating trying to prompt them to do anything reliably, and sometimes their bullshitting is more consistent than the actual functionality I want them to perform. It feels like, well, what it actually is, misusing technology. These "instruction-tuned" models can't follow instructions, you need fine-tuning, otherwise you're essentially applying linguistic duct tape to your project - it's universal and wraps around anything, but isn't going to hold together very well. I'm still going forward with my project anyway, though, since it'll still at least work most of the time and it's just for fun.