back to article Begun, the open source AI wars have

The Open Source Initiative (OSI) and its allies are getting closer to a definition of open source AI. If all goes well, Stefano Maffulli, the OSI's executive director, expects to announce the OSI open source AI definition at All Things Open in late October. But some open source leaders already want nothing to do with it. code …

  1. Drakon

    > That's all well and good, but Maffulli doesn't feel a purely idealistic approach to the open source AI definition will work because no one will be able to meet the definition.

    So it’s pointless.

    1. doublelayer Silver badge

      It's not pointless. Lots of models have training data available. I have several right here. True, none of them are LLMs. The ones I have are more narrow ones that can recognize interesting information from images or model specific actions, but if I want, I can download the training data and the configuration used to train the model, fire up my GPU, and a few short days or weeks later I'll have built my own copy of their model. Releasing the data is easily done as long as you have permission to it, which may be one of the reasons why some people who want to call themselves open don't want to.

      1. Anonymous Coward
        Anonymous Coward

        Part of the problem is clearly that current models are so crap at extracting key information from data (and then generalize it applicatively) that they need to gorge on the entirety of all data produced by the whole of humanity over the omnitude of its history before they can spit out even a modicum of a barely acceptable output that is not a complete and utter bag of prefabricated bovine excrement. Consequently, if open sourcing a model requires open sourcing its training dataset, then eventually it's everything that we've ever done as a species, since prehistory, that will need to be open sourced.

        Accordingly, it seems to me that FOSS can only apply to AI that is not overwhelmingly littered junky rubbish, or ridiculous baloney hogwash, or both, but rather to AI that is sensible, and that we just don't yet have today.

        1. doublelayer Silver badge

          That is a major problem for large models, and it makes it harder to open source them. For that matter, it makes it harder to legally make them at all. This does not change the definitions of open source in the slightest and there are very close parallels in other open source code. For example, there are lots of databases that have been painstakingly written by companies that are useful in a variety of programs. One such type is linguistic data. Open sourcing that is really hard because it took a long time to make it, so you usually have to license it for very narrow uses with very restrictive license terms. I wouldn't get to claim that my software is entirely open source, but just not those databases you need to run it. At best, I can say that my software is open source but won't work, as well or at all, without these closed source libraries. That is what I would do, and people would deal with that if they wanted to run using that more expensive data. I wouldn't lie and claim to be open source anyway.

          For smaller models, it is easily achievable. I recently dealt with a model which I thought was badly trained and generating bad results. So I obtained the training data, which came to about 20 GB and was permissively licensed, modified some parameters, and retrained in a different way. I can't do that without the training data. That model was actually open source. For the headline models such as the LLMs or those that generate images or video, 20 GB is tiny as training data goes. For my model, which extracted data from images, it was actually quite large. This makes no difference to the definition. If it is open and it has 100 TB of training data, then if I bother to get that many disks and that much network bandwidth, I have to be able to get and run on that training data. It makes distributing that a lot harder, which is another reason they'll probably choose not to do it, but it doesn't change what is required to have an open model.

          1. Anonymous Coward
            Anonymous Coward

            Good point!

        2. Anonymous Coward
          Anonymous Coward

          > FOSS can only apply ... to AI that is sensible, and that we just don't yet have today.

          Au contraire, as comments above attest, we have plenty of sensible AI today, even if you limit AI to neural nets and their derivatives. Especially models that work on images, with quite loose interpretation of "image" (e.g. a 2D gravimetric map or just EM waves outside the visible range).

          We may not have "sensible" LLMs, specifically, but that is another matter.

          1. Anonymous Coward
            Anonymous Coward

            Fair enough!

        3. Anonymous Coward
          Anonymous Coward

          > Consequently, if open sourcing a model requires open sourcing its training dataset, then eventually it's everything that we've ever done as a species, since prehistory, that will need to be open sourced.

          The most of written works from prehistory to XX century is in public domain. Wikipedia and many valuable sources are Creative Commons. It's not that bad.

  2. Anonymous Coward
    Anonymous Coward

    > Role of training data: Training data is beneficial but not required for modifying AI systems.

    > This decision reflects the complexities of sharing data, including legal and privacy concerns.

    I think this is crucial for defining a workable definition of open source AI.

    First, AI models are generally used to be adapted, retrained, taken apart etc. to be used to build something else. Almost nobody has the resources or money to retrain them, or any newer model, from scratch with the original data. Downloading and processing the 400B-1.5T token sets used to train the models will be an issue in itself for most users. The training itself is even more as training costs tens of millions of dollars ($63M for GPT4).

    So, how restricted are AI users by not having access to all the training data? It is not that retraining on the same data will result in an identical model anyway. It never does.

    And that is just the current problems involving scraping the internet. But we want our definitions to be useful for some time to come.

    Instead of collecting trillions of tokens from the internet, which will contain AI generated texts, future systems might/will collect training data while roaming the world. Current systems cannot be "updated" with partial retraining easily as that leads to catastrophic forgetting, ie, a collapse of the system. But that will be amended "soon". And a self-driving car that can learn while driving does not have to store every data-point they encounter. I think learning on the fly will be the future.

    Transfer learning from one system to another allows to use existing systems to "instruct" new systems without the burden of reusing the original training data. But then the whole point of requiring the training data to be Open Access/Open Source becomes moot.

    Such a future might or might not become reality, but any definition should include this or comparable types of AI. I think the idea of a layered definition where "data included" is an option, but only one of the options, will be much more useful and future proof than a "Puritanism above all" strategy. Puritanism will get us a definition that will literally be useless in practice.

    1. Flocke Kroes Silver badge

      Back when Monty Python first did the four Yorkshire men sketch Moore's law was impressive: modems went from 300/300 to 1200/75 bits per second and could swap direction! Hard disks cost a lot and could store more than a megabyte. CPU clock speeds went from a few MHz to tens of MHz. Cycles per instruction went for 4 down to 1. The bias from my advanced age makes me wonder if prices will continue drop until training an AI gets within the reach of a hobbyist. On the other hand Moore's law is getting much harder to maintain.

      1. Paul Herber Silver badge

        Yes, but that was Dennis Moore and there are no lupins here.

        1. Anonymous Coward
          Anonymous Coward

          Extraor

          .

          .

          dinary

    2. Fido

      The expense of training a model is likely to go down, just as a Raspberry Pi is faster and cheaper than the original Cray supercomputer.

      One important security aspect of regular open source software is verifiable binary executables. For example this makes it possible to check that unexpected changes weren't made to a program while creating the packages in a Linux distribution.

      A typical software engineer does not participate much in verifiability; however, the ability to make modifications to the source goes hand in hand with the need for verifiable binaries in open source.

      Given the even greater possibility of hiding things in generative AI, a class of verifiable models may be needed before people can safely use open source LLMs. While there are admittedly randomised elements in stochastic gradient descent and other aspects of training, requiring that the original training data be available is an important step towards verifiability.

      Note also that the more expensive it is to train a model, the more important it is being able to verify the result.

    3. doublelayer Silver badge

      It depends how large a model is. Lots of small models can be retrained from the training data by one person. However, I think that is unimportant. Whether people choose to train or not, you still need the training data or they don't have all of the stuff that goes into a model. Trying to call something open when something that crucial is over is similar to this argument about open source software:

      Faux-open guy: It's open source.

      Me: I couldn't compile and run it.

      FO guy: But you didn't, did you? You downloaded the binary release and went with that.

      Me: But if I had downloaded the source, it wouldn't have built the entire application, just a couple libraries that connect to the rest of it.

      FO guy: But if you had the rest of the source, it's large and it would take hours to compile, so you don't need that. It's open source.

      No, it's not. The model without training data is not open either. It's just a free as in beer model. I can bolt stuff on to a closed-source free model just as much as I can to one they've called open.

      1. Anonymous Coward
        Anonymous Coward

        > It depends how large a model is.

        First, note that an Open Source AI model can fulfill the four freedoms of Free Software entirely without the training data. Having the training data too is in the spirit of FLOSS, but not necessary to exercise the four freedoms.

        That said, the problems are not with the reasonable sized models. There are quite a number of Open Access data sets for training.

        The real problem is the big models that need half the internet.

        Moore's law, or whatever comes after it, ensures that the next LLM, or whatever DNN model used, will have as many parameters as can be processed by the largest collection of the latest hardware that can be sourced for money using all the training data that can be scraped together from every conceivable repository or web crawler.

        Which means that training the models everyone wants to use will be out of reach forever to all but a few of the largest entities.

        And an artifact that cannot be rebuilt from scratch can never be "truly" and fully open source in the purist sense.

        The original Open Source/Free Software definition excluded data. The stated reason was that data was different. Now we can see some of the reasons why the original authors did want to exclude data.

        So I do understand that we might need a layered definition of open source AI.

        1. doublelayer Silver badge

          Whether a model is too large to feasibly open source doesn't change how this works. There is no requirement that they be open source. I have some large software which is free, but it's not open source. It's still useful. Large models which are distributed under generous terms can be very similar. The problem is that one of the freedoms provided by open source is missing: the freedom to know what is in the program you're running. The FSF refers to this as "freedom to study how the program works", so I disagree that all four freedoms are there. So far, only one seems present. Without the training data, the prompts the model is based on, the ways it was trained, etc you have a black box, very similar to what I have in the binaries that I'm allowed to use but I don't know what's in them.

          In practice, open sourcing a model could be so hard that it's not worth doing. Retraining it to confirm that the binary you have matches the source data, or at least to make a binary that does for you to do later, may be infeasible. Neither of those realities changes what it means to be open source. This should be evident when we compare it to a small model. If a writer of a small model with a gigabyte or two of training data refuses to let me see, let alone modify and retrain, that training data, it is not open source and, as far as I can tell, they might have all sorts of extra unadvertised stuff in there.

          Without this, the freedom to modify is restricted. Yes, I can use other methods to change how the model works. You can argue that some of those count as modifying. Several of them are more akin to building a system around the program, the way that if I call a binary from my program, my program isn't a modification of that software, just a user of it. However, I can't modify everything. My ability has been constrained, and not necessarily by my access to resources to use. It has been constrained by the unavailability of the source to this program.

          1. Anonymous Coward
            Anonymous Coward

            > The problem is that one of the freedoms provided by open source is missing: the freedom to know what is in the program you're running. The FSF refers to this as "freedom to study how the program works", so I disagree that all four freedoms are there.

            I do not equate knowing how something works with studying how it works. You generally do not have all the information that went into coding a program while the program is still considered fully Open Source.

            The same problem is already apparent in cryptography, where there are open questions about choices made by the coders, or the selection of parameters. Questions the coders, eg, working for the NSA, refuse to answer satisfactory.

            1. doublelayer Silver badge

              You may not know why everything that is there is there, but you at least know what is there. That is the point of having the source.

              Let's consider a parallel to software. Microsoft Windows is very similar to LLMs in a lot of ways:

              Difficulty to actually use the source: If you had the source, it would be hard to use. Building that takes days on a big set of parallel processors.

              It's difficult or impossible to open source: Even if I was named total controller of Microsoft tomorrow, it would be very difficult and expensive for me to get the Windows source code open because there are lots of components they don't own and lots of license snarls to untangle, to say nothing of trying to distribute it to everyone who might want a copy or to provide documentation.

              Modifiable: As CrowdStrike has recently demonstrated, I can write some code and embed it deep into the kernel. That's very similar to how I can make some modifications to an already trained model to better tailor it for my purposes.

              Modifications distributable: Kernel-connected programs can be distributed in source or binary forms and installed by other people without needing to get Microsoft's permission first, though if I don't give them permission, the users have to work harder to install it.

              So does this mean that Windows should be considered open source? The only thing I'm missing here is the ability to see its source code and make modifications that specifically involve changing that source code, and I couldn't feasibly get that code or easily use it if I had it.

              I think the answer should be obvious. All these factors can be very relevant to whether someone considers it worthwhile to open source something. It is not relevant to whether they have.

    4. Donchik

      Positive Feedback Anomalies?

      Does not having AI learn from other potentially flawed AI not simply generate rubbish?

      GIGO?

    5. Falmari Silver badge
      Devil

      @AC "It is not that retraining on the same data will result in an identical model anyway. It never does."

      I had to do a double take when I read that. To me as a programmer there is something wrong when the results from a program can not be replicated using the same data.

      Now I am not saying you are wrong, as I have never tried to train a Neural Network*, let alone a LLM for a second time on the same data set.

      * When I was writing Neural Networks in the 90s, they would take days to train depending on the size of the satellite image data. So I was just happy when they did train.

      1. Anonymous Coward
        Anonymous Coward

        Not that AC but I would guess the reason you can't replicate it is because you would need to train it with the data in exactly the same order as you did before. The differences are going to be minimal but there will still be a difference.

    6. Zardoz2293

      It's irreverent if you can "afford" generating data from training. Having access to the raw assets to perform is absolutely critical. It provides the ability to do so. Considering the high rate of fraud and fake claims I'd think everyone would require as much Level 1 Open Source as possible. How many people actually compile their Linux source code? But it is available if you need it.

  3. Rich 2 Silver badge

    OSI

    Personally, I couldn’t give a monkeys what the OSI has to say on anything. Just because they say bla bla is open source (or is not), why should anyone care? They are just a self-appointed group with an inflated opinion of their own importance

  4. Anonymous Coward
    Anonymous Coward

    MOOT.....interesting word....look it up!!

    Quote: "Training data is beneficial but not required for modifying AI systems."

    Who wrote this?

    Do they not realise that training data "IS REQUIRED for actually USING an AI system"?

    Do they not realise that AI systems like ChatGTP are only useful because of petabytes of "training data"?

    ....and do they not realise that most of these AI systems are trained using petabytes OF OTHER PEOPLE'S PROPRIETARY DATA?

    OPEN SOURCE CODE

    ==================

    Sure....send me the (open source) NVIDEA CODE for some monster AI system.....but it is useless without some indication about how I might train it!

    OPEN SOURCE DATA

    =================

    I will bet that you can't send me links to enough UNENCUMBERED data to even get started!!!!

    So, if I'm right, then this debate about definitions is COMPLETELY MOOT.....in actual practice!

    1. iron

      Re: MOOT.....interesting word....look it up!!

      > Do they not realise that training data "IS REQUIRED for actually USING an AI system"?

      So how much training data did you need to download in order to use GhatGPT?

      I see no signs of intelligence in an LLM but even I know you don't need the training data to use an LLM.

    2. Anonymous Coward
      Anonymous Coward

      Re: MOOT.....interesting word....look it up!!

      > Do they not realise that training data "IS REQUIRED for actually USING an AI system"?

      I can easily collect thousands of articles where people took AI systems and partially retrained them for specific tasks, bolted on other DNNs, removed layers and customizing them as if they were built from Lego bricks.

      All without access to the original training data.

      For a short introduction see:

      https://nexocode.com/blog/posts/customizing-large-language-models-a-comprehensive-guide/

  5. may_i Silver badge

    The OSI are making themselves look stupid in public

    To start with, the people at OSI should be more than aware that no artificial intelligence exists and stop calling LLMs AI.

    Open Source means that you can have a copy of everything used to create the final system and this, of course, includes the data. There lies the problem. As a previous commentard pointed out, the training data for an LLM is of such a size that you will not be downloading it.. Feeding and training the model is a job requiring lots and lots of compute and storage capacity. Nobody is going to be doing this at home except for very geeky millionaires.

    I don't know what OSI thinks they will achieve trying to define some slippery new definition of "Open Source" for LLMs. It looks to me like they have been owned and are trying to weaken the concept of Open Source from the inside.

    1. Anonymous Coward
      Anonymous Coward

      Re: The OSI are making themselves look stupid in public

      > To start with, the people at OSI should be more than aware that no artificial intelligence exists and stop calling LLMs AI.

      OSI claim to have invented the phrase Open Source" (no, they didn't) and then provide *THE* definition of it (no, just *a* definition of it) and many, many people (like TFA's author) believe and repeat that.

      So, now the OSI will define what "AI" means - and if they say it means an LLM then that is what it *will* mean; won't be long until we "learn" that OSI invented the term in the first place.

  6. Will Godfrey Silver badge
    Unhappy

    It's a duck!

    If it looks like a duck;

    Walks like a duck;

    Quacks like a duck;

    See title. Whatever else it could be, it's certainly not FOSS.

    1. Anonymous Coward
      Anonymous Coward

      it's certainly not FOSS.

      Come off it. What company in their right mind would reveal their deepest business secrets. ChatGPT was open source and we absolutely ripped it apart and rebuilt it in our image.

      Give me your work from today for free. Thanks.

      Even if you re-train, there is no guarantee your prob rates will be better. It might give you the answer you want, but that is not always the one needed.

      Liquid Neural Networks offer a solution. There is nothing to open source and very little training needed: only the output readout function layer needs it as it 'watches' the LNN(high-dim representation of temporal patterns of input data).

      It will be free for all. Because who wants billions of moneys!

      1. Michael Strorm Silver badge

        Re: it's certainly not FOSS.

        > "What company in their right mind would reveal their deepest business secrets"

        So, what's your point? Whether or not one agrees with that principle, it doesn't contradict OP's assertion that the contrived definition proposed quacks like a very non-open-source duck.

        > "Give me your work from today for free. Thanks."

        That's ironic considering that all today's well-known LLMs have been almost entirely trained upon massive amounts of other peoples' work without renumeration or permission.

        1. Anonymous Coward
          Anonymous Coward

          Re: it's certainly not FOSS.

          >> "Give me your work from today for free. Thanks."

          > That's ironic considering that all today's well-known LLMs have been almost entirely trained upon massive amounts of other peoples' work without renumeration or permission.

          I think that's the best bit about it. For the old-timers: ROFL

          >> "What company in their right mind would reveal their deepest business secrets"

          > So, what's your point? Whether or not one agrees with that principle, it doesn't contradict OP's assertion that the contrived definition proposed quacks like a very non-open-source duck.

          Indeed. But the point still stands. OpenAI stopped that as soon as they realised. Like Chain-of-Thought was the easiest reverse-engineering tool ive ever used. Now it is a shadow.

      2. Anonymous Coward
        Anonymous Coward

        Re: it's certainly not FOSS.

        onethis

  7. Groo The Wanderer

    You know, if we have much hope for long-term survival as a race, it might be an idea to stop thinking of everything as being a "war"...

  8. Bebu
    Windows

    "contains so many weasel words that you can start a zoo..."

    A zoo full of only weasels?

    Ah. I guess Tarakiyee had the US Legislatures in mind.

    Or actually any legislature I imagine.

    I suppose there is no point in asking ChatGPT-n for a definition. :)

    I recall OSI once meant Open System Interconnection which I recall in its heyday also had some spectacular bun fights.

  9. Anonymous Coward
    Anonymous Coward

    It is in theory possible to extrapolate the whole of creation—every Galaxy, every sun, every planet, their orbits, their composition, and their economic and social history from, say, one small piece of fairy cake.

    The Guide

    1. Ken Hagan Gold badge

      Only if you know more about the cake than is permitted by the Uncertainty Principle.

      1. jake Silver badge

        THE CAKE IS A LIE!

        1. Inventor of the Marmite Laser Silver badge
  10. Anonymous Coward
    Anonymous Coward

    "purely idealistic approach"

    "Maffulli doesn't feel a purely idealistic approach to the open source AI definition will work because no one will be able to meet the definition."

    Not understand the logic, I do.

    1. "Open Source" := <precise definition> [e.g. checkability]

    2. No (current) "AI" meets <precise definition>

    3. Therefore <precise definition> must be modified (extended, softened, etc) so that (some of) "AI" fits in.

    Is this how it goes?

    If so, Why?

  11. Ian Johnston Silver badge

    Easy test. If holy wars are being fought over a dozen different licences, it's open source.

  12. imanidiot Silver badge

    Full permission or bust

    IMHO, any "AI" (this doesn't exist) or LMM or image generating algorithm basically needs to be trained to actually do anything. The code that does the model training or accesses the training data is mostly trivial or useless without that training/trained dataset. Unless the dataset can be proven to ONLY contain content which has been released under a license permitting use in such a dataset, imho, it can't be open source to begin with.

  13. Grunchy Silver badge

    Hack

    Nobody will ever know if AI had been secretly manipulated to preferentially do something by some malicious actor sabotaging the learning data.

    Because the data is too big to retain and to audit!

    1. Anonymous Coward
      Anonymous Coward

      Re: Hack

      I know, 'cause it;s me

  14. Groo The Wanderer

    As I expected, the OSI is pretty much bang-on as to what "levels" of openness there can be with AI, but I fail to understand why the same concept of openness "levelling" doesn't seem to be used for software. Either a project is open by the definition of the concept, or it is not. It can't be "sometimes open" or "mostly open."

    This isn't The Princess Bride.

  15. Fonant

    Once the general public realise that LLM "AI" is just bullshit-generation, the whole thing will disappear like NFTs and blockchains. The OSI definition is an interesting logical exercise, but will be of no value to posterity.

    1. jake Silver badge

      "The OSI definition is an interesting logical exercise, but will be of no value."

      FTFY

  16. Sam Johnston

    Access to training data is enough to protect the four essential freedoms

    Nobody's still seriously asking for a "purely idealistic approach" but the OSI keep trotting out this strawman. We have settled on a compromise that allows both open licensed and publicly accessible data, but rejects proprietary data available for a fee (e.g., NYT articles, Adobe stock photos) and inaccessible data (e.g., Facebook social graph) as these prevent you from studying or modifying the model, while exposing you to legal, security, and other risks.

    In any case, the OSI admits they are not competent to answer the question in outsourcing it: https://samjohnston.org/2024/10/15/the-osi-lacks-competence-to-define-open-source-ai/

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like