"Surely this settles the case of if AI training crawlers are legal? If there is a robots.txt file and they ignored it, that's unauthorised use."
Not quite - said robots.txt must also explicitly ban AI training crawlers. At that point, I agree that the site has gone far enough to indicate that such crawlers are not welcome, though some kind of inpage NOAI tag akin to NOINDEX would probably be helpful, too. I would also add something to the footer of every page, explicitly stating that the content is not for use by AI crawlers.
With these measures in place, there is no need for this silly arms race - one has enough tools available to issue cease and desist letters and seek reparations if your content is found in training data etc. The important point, however, is that all of these measures must be in keeping with the current permissive nature of the net. Search engine bots will crawl your site unless you tell them not to, and it's only reasonable to adopt the same approach with AI bots.
If you really must feed them twaddle - and let's face it, much of the human generated web is, and always has been twaddle - why not feed them an infinitely long page, generated on the fly? Sooner or later, the thing is going to run out of memory and crash. Assuming that the bot is capable of rendering JavaScript, you could even drop in a little infinite loop and let it feed itself infinite twaddle. I'm sure there are plenty of ways to play silly buggers with these things - blocked IP addresses, maybe.
But I won't be doing any of those things. I don't care if they want to crawl my sites - indeed, they are welcome to do so. The web has always been a place to publish stuff, on the understanding that some people you don't like or approve of are going to read it. Yes, you can get arsey and squeal that they're infringing your copyright in some way (nobody has yet managed to show me exactly how an AI bot is doing that, though, with the exception of ingesting pirate material) but the fact is that if you want to control access to your content you need to implement measures to do that, otherwise material published on the open web is pretty much fair game.
It's easy enough to do; put it behind a paywall, or even a password protected directory - no password, no read. But that buggers up your SEO, doesn't it, and those lovely search engine bots - which will also crawl your site and make copies, but nobody seems to object to that - won't be able to get in.
Unfortunately, this AI malarkey is here to stay - sometimes it's even useful. Might as well stop wasting effort in an arms race you can't win and get on with your own thing.