* Posts by MayYouLiveInInterestingTimes

1 publicly visible post • joined 6 Nov 2022

How GitHub Copilot could steer Microsoft into a copyright storm

MayYouLiveInInterestingTimes
Stop

Nature of copyright, big future consequences, readers rooting for their tribe, not thinking longterm

I know readers here don't like MS, but this really is not in their spirit you all so dislike.

So to review a few scattered points and issues:

- In general it's considered fair use to train machine learning models on publicly available data.

Models do not typically memorize, although they could memorize a thing here and there, you're going for generalization.

Yes, you will sometimes get memorization of pieces of data that are incredibly common in the dataset.

If people made thousands of pictures of fanart of some popular character, it will learn to draw that character well.

If most people write some few liner utilities the same way, your code model will do it in the same way.

Google won a landmark case some years ago that concluded that training even on copyrighted content like books was fair use!

- For image generation models, the current legal consensus appears to be that the output is not copyrightable as the author is not a human!

If this consensus is carried on, it will make the legal status of generated code maybe not one companies like MS would like.

In fact github has plenty of their own leaked code reuploaded by users who don't really care about copyright and likely many code models have some knowledge about some propertietary internals, not just GPL or BSD or other open licenses.

Most likely the model seeing even code like that is as much fair use as you seeing that code without copying it verbatim, some people do sometimes avoid looking at code for potential legal issues, but of course it's a big gray area.

Maybe a fun future would be one where everyone just gives up on copyright as a flawed concept and chooses to make code which is simply uncopyrightable and sort of pseudo public domain - if MS starts using it for their code writing, which they are, a lot of their internal code might not be copyrightable or very gray area!

- Some of you are upset because copilot is closed source and paid, while being trained on open source code. Who cares? There are open source models trained on open source data and you can use those just fine.

The problem with using these, especially the better ones is the hardware needed to run it. We need a lot more VRAM to run these, and Nvidia and other manufacturers keep the big VRAM models for enterprise use and at severely inflated prices.

If you want freedom to run any of these things, and there will be a lot more interesting things than even code models in the future, the average person needs a lot more powerful hardware for their own use, not just cloud.

Problem might be made even worse with the current US and China tensions - US is blocking a large part of the world (China) from being able to make use of said AI chips (and GPUs), but this may not matter unless you live there.

However, this is a war on general computation almost, and one that we should fight to get our freedom to run anything we want on our machines. There are many small and big players that are very bothered that at some point the public will be able to run the more advanced future AI models and would do anything to stop this.

- From my personal viewpoint, I don't see an issue with this: you learn by reading other code and that's fine. It's how your brain works, you compress the information into concepts and you use it as you see fit.

Your big machine learning models do a similar thing, they don't just memorize data unless that data repeats often or is very common (or you train it in a way that this repetition happens).

You may claim that the way ML models learn needs too much data (for LLMs big models actually learn much faster, probably because they can better leverage what they learned before and there's a lot more room to express the higher dimensionality of the data), but even then the general idea is still there.

Let's say this case goes south and MS loses (I expect them to win like google won against that landmark case on training against copyrighted books), now this makes copyright far stronger than it ever was, and whatever applies to the machine also has to apply to the human, after all, the learning processes will get closer and closer.

Yes, you may view MS with dislike for how much they push their strong copyright views, but in this case, they are the ones against copyright! As someone with a strong dislike of copyright and IP laws, I know where I stand!

Them losing the case would also set even more dangerous precedents, in the future we will end up training models that will approach human ability in dimensions that now seem weak, and the way forward to that is obvious (even if maybe not to many readers here), that is, we will approach human generality or even reach it.

Ignoring alarmist/catastrophist thinking about what happens when we reach that point, consider the idealized case where at some point you would have what is essentially a person or something with a thinking process close to a person, and who is not allowed to learn from human works, and probably not allowed to copyright their stuff either (the latter I don't mind, as I don't believe humans should be able to "copyright" anything either).

This then gets a very human chauvinist perspective. I know a lot of unspoken neoluddites would applaud this, and hope that cases like this would stop progress by reducing financial incentives, although in the very long term, we would be giving up ability to automate science and research itself and thus doom humanity to much slower technological progress than would be achievable if we build truly autonomous agents capable of doing such work.

This may seem like a little thing now, but precedents could get set which would change history in bad ways. I do hope that they do end up in the right direction, and here MS is actually fighting against their own drive for more copyright and this will probably help human technological progress by leaps and bounds.