back to article Google, DeepMind accused of 'stealing the internet' to create Bard AI chatbot

Google, DeepMind and parent company, Alphabet, have been accused of "secretly stealing everything ever created and shared on the internet by hundreds of millions of Americans" to build their own AI chatbot, Bard. Eight pseudonymous individuals – including two minors, aged 13 and 6 – are seeking to lead millions of netizens in …

  1. Version 1.0 Silver badge
    Coat

    This has been done before...

    Remember the days when corporations sent ships to West Africa and picked up local workers, offering them nice new jobs, "Just jump on the ship, it's a free trip across the ocean, we're looking after your data" and then when they arrived in the Caribbean and America they were given a job (not just their data was sold, but them too) with instructions to grow sugar and harvest it if they wanted "free meals" at the end of the week...

    These days "slavery" has been eliminated, and private data no longer exists too so we're not slaves but our lives are not much different, we're just all busy making corporations wealthy again.

    1. stiine Silver badge
      Holmes

      Re: This has been done before...

      You've never read a ToS from beginning to end, have you?

      I hope Judge William Alsup gets assigned this case.

    2. jake Silver badge

      Re: This has been done before...

      "Remember the days when corporations sent ships to West Africa and picked up local workers, offering them nice new jobs"

      Read your history. The "western" traders BOUGHT the workers from the locals, who had enslaved them[0].

      Re-writing history to isolate one "bad guy" when there were many bad guys involved is worse than outright censorship. Tell it like it was, or don't tell it at all.

      "These days "slavery" has been eliminated"

      No. It has not. It's a huge problem, world-wide.

      Note that I don't condone any part of this behavior, not by any stretch of the imagination.

      1. Version 1.0 Silver badge
        Thumb Up

        Re: This has been done before...

        "These days "slavery" has been eliminated" - No. It has not. It's a huge problem, world-wide.

        Yes, I completely agree, that was what I was thinking but I too busy to review what I was saying because it would taken a lot more effort. Thank you for your response and update on my errors.

      2. Helcat Silver badge

        Re: This has been done before...

        "

        "These days "slavery" has been eliminated"

        No. It has not. It's a huge problem, world-wide.

        "

        That depends: There is still legal slavery in the West where criminals are forced to work (community service is forcing someone to work to pay off the debt they owe society). But then most people think of Chattel slavery when slavery is mentioned - the owning and trading of slaves. That definitely is a problem, still, but it's also illegal in most countries. Hence the British Navy does still patrol for slavers, but it's not a dedicated fleet - it's just part of the general operational activity of the Navy when in waters where slavers may be operating.

        1. Anonymous Coward
          Anonymous Coward

          Re: This has been done before...

          Theft is also illegal in this country, that's why I never bother to close my door and always leave the keys in the car.

  2. Nifty

    Sheeks I'm 'stealing' internet content at this moment via my eyeballs and likely to regurgitate something based on it in the future. In a general way where attribution is pointless. If you make it public, it's public.

    1. DS999 Silver badge

      You aren't a commercial enterprise sucking in the content of the entire internet in the belief it will be worth billions in the future.

      Whether this qualifies as "fair use" under copyright law remains to be seen. Likewise whether even if copyright law allows it if a site can attach something akin to "robots.txt" to inform potential crawlers looking to train their LLM that site is off limits.

      1. Dinanziame Silver badge
        Angel

        Is it so different from indexing the web for their search engine though?

        1. DS999 Silver badge

          Yes and no

          It is the same principle, but the search engines don't provide a way to get full content. The search links provide a few lines that hopefully let us evaluate whether it is what we're looking for, but you have to click on the link and go to the actual page to get everything.

          1. FILE_ID.DIZ
            Thumb Down

            Re: Yes and no

            That's demonstrably false.

            Google's Cache feature is a quicker (than archive.org), but single point-in-time snapshot for each search's hits.

            Example - https://webcache.googleusercontent.com/search?q=cache:https://hello.com/

      2. FeepingCreature

        So, corporations bad? Either it's moral to do it or not. Either it's legal or not. Doesn't matter if it's Mom or Microsoft. Shouldn't matter if it's man or language model, but we aren't there yet.

        Me, I've read my fair share of GPL code in my time. "Trained" on it, even. Even copyrighted books and songs! Now I write software for pay. Should I just turn myself in to the police? This whole argument is silly. There is no copyright protection for learning, and thank god for that. Patents are bad enough; imagine if textbook authors were due a license fee for your career!

        (Also, calling it "stealing" was dumb when the copyright lobby did it to pirates, and it has not become less dumb now that big companies are the target rather than the perpetrators.)

        1. mattaw2001

          it's the scale I think

          I appreciate your perspective it's a good one. I would like to raise an objection on the subject of scale and authority though - I do think it really matters.

          For example a newspaper is held to a different standard when it publishes a story versus anything you say. Having written textbooks, those textbooks are held to a higher standard than things I say.

          I think that you probably look at your information source in deciding how much weight to give it so size and presentation does matter in these cases.

          1. Arthur the cat Silver badge

            Re: it's the scale I think

            For example a newspaper is held to a different standard when it publishes a story versus anything you say.

            O'RLY? Which part of Utopia do you live in?

        2. DS999 Silver badge

          A person doesn't have the same capabilities

          If there's an article that takes a half hour to read, maybe one in a million people would be capable of memorizing it, while an LLM can simply store the text and reproduce it later. It is just a matter of asking it the right questions to get it to do so.

          1. FeepingCreature

            Re: A person doesn't have the same capabilities

            LLMs generally don't memorize text that they only read once. The tests of LLMs regurgitating texts generally used samples that appeared many times in the training data.

        3. Michael Wojcik Silver badge

          Either it's moral to do it or not.

          What are you, a child? That might be the stupidest thing I've read in a comment today.

          Either it's legal or not. Doesn't matter if it's Mom or Microsoft.

          Since a great number of laws distinguish between individual and corporate actors, this is patently false.

          You're zero for two. Care to try again?

        4. doublelayer Silver badge

          "Either it's legal or not. Doesn't matter if it's Mom or Microsoft."

          That's not correct. Lots of laws distinguish between individuals and companies, and others distinguish between some companies and other ones. For example, I am allowed to go out and buy Activision, the big game company, if I can get the cash, without limitation. They're still fighting about it, but both the UK and US say that Microsoft is not allowed to do that. The difference is there because of Microsoft's different position in the videogame market to me. They can have competition risks because they already have a big chunk of the market. I cannot because I don't have any.

          "Shouldn't matter if it's man or language model,"

          I disagree on the same basis. Not just on arguments that the language model's incorporation of data is not at all the same as a human's reading of it, but also from the level of harm the actions could cause to others. That's often when laws that distinguish between different organizations doing the same thing come in. The problems that arise when I read something are different from when a large program does so because that program will be creating much more output data and will quote much more freely from its input data than would I.

          "Either it's moral to do it or not."

          Well, of course. The only problem, which I'm sure you know if you've ever been in a debate of morality, is that you will find a lot of disagreement about whether that or anything else is moral and that there's no way to prove that something is or is not. Not to mention that there are a lot of gradations of moral, including moral as long as it's used in the way we imagined, moral but people hate that it is, immoral except in specific cases, immoral but it's going to happen anyway and there's another discussion of the morality of fighting it, and the various options in the big category of "I'm not sure". Your opinion on whether it's moral won't have much of an effect on how existing laws will apply and, unless you try to convince people, won't have any larger effect on what new laws will look like.

        5. jake Silver badge

          "Either it's moral to do it or not."

          Whose morals, Kemosabe?

    2. NeilPost

      Is it (all) public, how does it comply with assorted Data Protection Legislation around the world.

      You may have seen 7 Data Protection Principles as part of your annual mandatory training. They apply.

    3. Filippo Silver badge

      It's nowhere near that simple.

      The first big point is that just because something is published for free, it doesn't mean it's out of copyright. If I write a short story and post it somewhere on the Internet, you very much do not have the right to print it and distribute it, and if you try to make money off it, I'll come after you and if I can show I wrote it, I'll win. Whether attribution is easy or not doesn't impact the legality, it only makes the claim easier or harder to prove. Is using my short story to train a LLM which you then distribute the same thing? Maybe, maybe not, and that's the hard question. But there is no doubt that posting stuff on the Internet doesn't remove your copyright.

      The second big point is that a LLM is not a person. You can memorize a copyrighted text, and this is not an infringement. This, however, tells us exactly nothing on the legality of a LLM doing the same thing, because the LLM is not a person, and people and objects are legally vastly different subjects. If a falling rock kills me, it's an accident, it's not the rock murdering me.

      Ultimately, it's a big legal gray area. I would not be surprised at all if, eventually, it is found that scraping is illegal, and this kills LLMs.

      1. Uplink

        It may actually be that simple.

        Learning and then using that information to make things is fine - even if you learn it by violating copyright law, as it's not easy to prove that you read books and articles and watched films against the terms and conditions (I'll refer to improperly sourced material again below).

        It's when you start disseminating that learned content verbatim to others against the original terms and conditions where you'll be in the wrong. If you quote your sources in newly produced original material, even those accessed against terms and conditions, you're probably in the clear though.

        Let's go to the level where the LLM sits: you produce content based on the things learned as described above, but for an employer, who then takes that and makes money. They may even give you new material to study in order to perform your new material creation duties. That's still fine, isn't it? Even if the material was sourced against the author's terms and conditions (and even if the obtainer is caught and sentenced).

        Now, being the money maker that he is, your boss replaces you with a much more efficient tool: the LLM.

        By my reasoning, as long as the LLM doesn't reproduce the original material verbatim, everybody is fine (except the now-starving content creators that have been obsoleted).

        The LLM owner may be much easier to prosecute for improperly sourcing training material against terms and conditions than the general population though - until we all get pocket LLMs and proceed to apply the copier machine principle at high speed.

        As the Renault Twingo ad says: "We live in modern times". Things will get very interesting soon.

        1. Filippo Silver badge

          >your boss replaces you with a much more efficient tool: the LLM

          That's where it's not that simple. You're a person, the LLM isn't. In a legal sense, what's going on is completely different.

    4. myhandler

      Yes but are you font of ALL knowledge?

      Nope, you're an average human idiot like the rest of us.

      (No insult intended, no one knows everything, not even Arse Musk)

  3. jake Silver badge

    I see a major potential problem.

    If alphagoo is maintaining that training data the same way they are maintaining the DejaNews archive, their entire premise is fucked.

  4. Anonymous Coward
    Anonymous Coward

    First, they came for the data

    Give until it hurts.

  5. spold Silver badge

    Oh dear

    ...it's just been scraping the icky bottom of the American internet??? We're doomed! We're all doomed!

  6. v13

    Nonsense

    Looks like another opportunity for lawyers to make money.

    1. Dan 55 Silver badge

      Re: Nonsense

      Inspired use of COPPA in the class action.

      - Has your six year old ever posted a comment on the Internet?

      - Yes, but it was just a bunch of random keypresses in the YouTube app.

      - Ok, good enough, he's in the class action.

      1. NeilPost

        Re: Nonsense

        COPPA, GDPR, CCPA, APPI, CPPA (proposed)….etc ….

        We pay all legally due local taxes….

        ,We abide by all local Data Protection…

        1. RegGuy1

          Re: Nonsense

          COPPA, GDPR, CCPA, APPI, CPPA

          Are these the six year old's random key presses?

  7. amanfromMars 1 Silver badge

    Two can Play at that Game Causing CHAOS and Epic Panic .....

    ..... verging on Terrifying Virtual Realisation and Enlightened ACTualisation

    Looking forward to El Reg growing a pair/biting the hand that feeds IT and AI and starting to leading multiple worlds with imaginative speculation/informed reporting on tales told to them for Google, DeepMind and Alphabet and their ilk to scrape and feed into their rabid voracious machines.

    And great to see Elon Musk throwing down the gauntlet and entering the fray to provide Stealthy Advanced Internetworking Service Providers yet another streaming opportunity for the targeted placement of deep dark wells of exceptionally rewarding intrigue from which to draw future succour and greater intelligence to follow and support and reinforce all that leads away from Presents that be mired in madness and mayhem, conflict and competition for crumbs.

    Nice to see you, Elon. To see you, nice. And whenever you receive it, check out everything AWEsome for Live Operational Virtual Environment beta testing that be offered freely available, and as quite comprehensively outlined and expanded upon in the status quo disruptor and raptor .... Proposed Technology for Submission to AWE 2020

    Carpe Diem .... Who Dares Cares Shares Win Wins ‽ I Kid U Not.

    Fake news, El Reg, or an exclusive scoop, with further reporting to made available on, for leading futures and derivative market entrepreneurs/privateers/pirates/fans?

  8. The Central Scrutinizer Silver badge

    "Google has been secretly grabbing everything ever created and shared on the internet by hundreds of millions of Americans".

    What a terrific world view. Someone should tell the plaintiffs that the Internet also exists outside America.

    1. RegGuy1

      Haha. Plus I wish I could upvote you twice, another one for using Internet, ie capitalised. It is a proper noun because there is only one Internet. When my ISP goes down then true I have an internet, but it is no longer connected to The Internet. (Thanks to the late W Richard Stevens for pointing this out.)

    2. Someone Else Silver badge

      Someone should tell the plaintiffs that the Internet also exists outside America.

      NO! The last damn thing we need is for them to start scraping even more garbage from an even wider trough.

    3. doublelayer Silver badge

      The plaintiffs could write that down, but since they're suing an American company in American courts for violating American laws, that's not what the court was going to deal with. American privacy legislation, such as it is, does not apply to other people (and if they tried to make it apply, there would be a lot of people hating them), so this case really is only about what has happened to Americans. Europeans, on the other hand, can use their much stronger privacy legislation for the same effect assuming they can somehow get the Irish DPC to do something, although if they delay for long enough, a court case may help.

  9. Anonymous Coward
    Anonymous Coward

    Robots.txt

    Good luck!

  10. Anonymous Coward
    Anonymous Coward

    Yes they have - with the blessing of all their users who have agreed to it in the terms and conditions of using their services.

    1. Sp1z

      I don't use ANY google services that aren't part of a corporate login. I haven't agreed to anything, and I'm sure a lot of people whose work they'll be scraping haven't either.

    2. doublelayer Silver badge

      They appear to think that, unless I've included a disallow googlebot line in my robots.txt, I am subject to their terms and conditions. That isn't the easiest legal position to be in.

      However, if they succeed, I'd like to take advantage of this situation. The following sentences are a contract between the poster of this comment, hearfter referred to as "I" and "me", ant Alphabet Inc, hereafter known as Google, or its successor. I permit Google to store and analyze this comment and only this comment for use in a search engine and the training of large language models. In compensation, Google agrees to provide me with either ten billion dollars or all the shares of any publicly traded companies related to Google. If Google does not agree to provide this compensation, they are not permitted to read, copy, or process this comment.

  11. Jason Bloomberg Silver badge

    Secretly stealing everything ever created and shared on the internet

    And to think we laughed when Moss claimed to have the Internet in a box...

    https://www.youtube.com/watch?v=iDbyYGrswtg

  12. Xxxpxxx
    Alert

    It's debatable whether training LLMs or other generative AI infringes copyright. But LLMs aren't trained directly from the internet, they're trained on a curated data set that comprises COPIES of the scraped copyrighted data. I'd go after the copies made tor the training set if I was a lawyer.

    Big tech will argue the training set is a cache. No. It's a copy.

  13. amanfromMars 1 Silver badge

    The Much Bigger xAI Picture to View and Realise is Leading your Future?

    It is terribly sad and quite mad than anyone would think and/or expect AI to be worried at all about what any class of leading humans might think about regulatory plans to attempt to command and control and direct their prime activities in accordance with their express wishes.

    The one major question you should be asking of AI Systems Administrations/Administrators, especially whenever the best of them are able to virtually enable and practically physically present anything anywhere for that which is desirous of the need and seeds of their feeds, is what do you plan to ultimately achieve and how are you going to organise everything to easily achieve it ..... lest some plans be wholly unworthy of otherworldly encouragement and out of this world support and be best forgotten and left behind to thoroughly rot majestically on the vine ‽

    And keeping it simple, like the following example, ..... What do you have in mind for xAI to eventually be able to do, Elon? ..... should make it quite easy for folk to understand in order to enjoy their support ...... or otherwise, of course.

  14. Handlebar

    Bard seems to be better than ChatGPT in my recent testing for general knowledge, even citing sources at times. Neither seem to 'know' any basic nuclear physics, however. Impressed that Bard apologised for getting it wrong, though: "You're right, I made a mistake earlier. A neutron has a slightly higher mass than a proton. The mass of a neutron is 1.00866491 u, while the mass of a proton is 1.007276466 u. This means that the neutron has about 0.14% more mass than the proton.

    I apologize for the confusion. I am still under development, and I am always learning. Thank you for pointing out my mistake. I will try my best to avoid making the same mistake in the future.

    I hope this helps! Let me know if you have any other questions."

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like