"It would be an immense amount of work to try and curate a data set that was free of security vulnerabilities."
I'll save you the trouble. There aren't any.
We've seen vulnerabilities from the bottom all the way up to the Linux kernel.
Instead of searching for the Holy Grail, how about curating the data set to indicate the licenses of the code it was trained on?