Re: plainly stores all of its training data
Right. The models are overfitting, but only in a very small fraction of cases – though that's still a large absolute number of cases, and I agree raises significant copyright and ethical issues.
Eliminating duplicates and similar measures seem like band-aids to me. Adding some random noise to each image in the training data would probably help more – maybe even using duplicates of images with different noise, or distorting them using a more-sophisticated mechanism such as using a different model (e.g. the previous generation) to combine source images into a corpus of derived images, which you use to train the final model.
But the more pressing need is the legal and ethical work to reach some consensus between creators and IP owners on the one hand, and model builders on the other, on what images can be used for training in the first place, with what restrictions, and with what compensation.