Not having a canonical copy of the dataset is just Bad Science
From just the abstract of the paper linked to by the article under the description
> Data poisoning has become more and more interesting," Fabian said, pointing to recent research
>> Our first attack, split-view poisoning, exploits the mutable nature of internet content to ensure a dataset annotator's initial view of the dataset differs from the view downloaded by subsequent clients. . By exploiting specific invalid trust assumptions, we show how we could have poisoned 0.01% of the LAION-400M or COYO-700M datasets for just $60 USD.
So they don't actually have a dataset carefully stored away? Instead they are just reading a page from the web and saving annotations about what that page says *today* and are then horrified to learn that tomorrow it may say something totally different! And somehow this fundamental feature of the web is now a Bad Thing and indicative of naughty people deliberately poisoning their precious dataset!
Heck, even the worst of the "scribble it down, don't edit, just chuck it onto the blog, never update it again" pages can be changed day to day by the content of the comments at the bottom.
Do you think that the last sentence in the abstract means they told the dataset collectors to save a copy of the page before bothering to annotate it:
>> In light of both attacks, we notify the maintainers of each affected dataset and recommended several low-overhead defenses.
Ah, no, they wanted "low overhead" so just doing the science properly (as in, with repeatability) is probably going to be ignored. And no chance at all that they'd ever think to help fund the Internet Archive (or even hosting a mirror!) and only annotating unchanging content from there!
Oh, it would be so good, when I read the whole paper (soon, not tonight) to find that the above is all just misplaced cynicism.