Victim blaming
"Part of this is down to the restrictions in robots.txt and the ToS not lining up. 34.9 percent of the top training websites make it clear in the ToS that crawling isn't allowed, but fail to mirror that in robots.txt"
A classic case of victim blaming. For one, robots.txt can not mirror the ToS anyway, because the former is a technical, and the latter a legal document, designed to express very different things that are only marginally related to each other. Also the robots.txt format does not allow to include instructions for any yet unknown entity (crawler), is not legally binding, and even if it would be could be easily circumvented by simply renaming the user-agent in question.
And finally robots.txt and copyright law have completely opposing defaults. The robots.txt spec essentially says that if the file is missing or any explicit disallow directive is missing, then you're free to crawl whatever url you want. Copyright law on the other side says that unless there's explicit permission, you're not allowed to access or use any copyrighted content. So, of course robots.txt - which is not law and not a contract anyway - can not possibly used to override copyright law (and the ToS), but AI companies act as if it would, and that the lack of explicit disallow directives would give them permission to use any content for training.
Also large internet corporations, like for ex. Google or Facebook, etc. are actively working to make robots.txt useless for blocking AI crawler even at the theoretical/technical level, by simply using the very same user agent that they use for collecting content for other purpose. So, for ex. Facebook is using the very same user agent ("facebookexternalhit") for obvious AI crawling that they've been and are still using for building the link previews in shares - which in the end means that you can't tell to them to not train their AI models with your data but show meaningful link previews in shares.
Similarly Google has Googlebot-Extended, which supposedly is there so you can prevent them using your data for AI training - but that only applies to training for ex. their Bard model, but doesn't affect AI Summaries, which are shown based on what data their regular Google crawlerbot has retrieved.