How many training sets are in common use that haven't already been poisoned? There seems to be a new article every week or so on how easy it is to screw with ML training data, and it seems an awful lot of people use a limited number of shared public datasets because actually gathering your own data is both difficult and often illegal. What are the chances that no TLAs or criminals (but I repeat myself) have put two and two together? Even if it's difficult to poison a dataset in advance without knowing the precise result you're going to want in the future, you'd have to assume they've done it even if only as practice for future operations. Any dataset lacking a full chain of custody for every element has to be assumed to be riddled with backdoors. Even if they're not implemented in a way that's useful for a malicious actor, how can you trust any results when any random query could hit some hidden trigger?