back to article Researchers poison stolen data to make AI systems return wrong results

Researchers affiliated with universities in China and Singapore have devised a technique to make stolen knowledge graph data useless if incorporated into a GraphRAG AI system without consent. Large language models (LLMs) base their predictions on training data and cannot respond effectively to queries about other data. The AI …

  1. SVD_NL Silver badge

    "Oh no, my LLM can't use this treasure trove of stolen data!"

    So, this method basically adds a bunch of junk data to real data and makes the LLM more likely to choose junk data when it queries without an encryption key?

    I don't see how this actually protects against IP theft, unless the only IP you're trying to protect is the knowledge graph itself, not the underlying data as you should be able to extract that using other means. I'm sure there's cases where this has some real-world applicability, but i feel like most companies wouldn't be happy about the plaintext data being stolen, even if it is slightly obfusciated.

    1. Jedit Silver badge
      Headmaster

      "I don't see how this actually protects against IP theft"

      Technically it doesn't. However, the thieves aren't the direct target. The junk data makes the output of the LLMs (even) less reliable. If the user base of an LLM - the buyers of the stolen goods - know that the output can't be relied on, then in theory they'll stop using it. That's where the thieves lose their money.

      1. bob, mon!

        Re: "I don't see how this actually protects against IP theft"

        "If the user base of an LLM - the buyers of the stolen goods - know that the output can't be relied on, then in theory they'll stop using it. "

        Really? And how will the buyers know that they shouldn't rely on it? Most people accept computer output as authoritative --- "Garbage In, Gospel Out" was true 40 years ago. Being able to get a response from a "natural language" prompt only adds a veneer of credibility.

    2. kmorwath

      Re: "Oh no, my LLM can't use this treasure trove of stolen data!"

      This idead does protect AI companies about the theft of their precioussssss data, doesn't protect the original IP owners from their data being stolen.

      Now, if you protect your own models built on your own data that's fine.... but of course that can be used (and will be used, if it works) to protect models built on someone else's data.

  2. that one in the corner Silver badge

    The threat model here assumes that...

    Any comparisons available between the likelihood of those assumptions being met and the cost/complexity of this method, given the admitted holes in it?

    But on the bright side, assuming (!)* that these knowledge graphs are comparable to the knowledge graphs from the 1980s, nice to see that people are catching up after 60 years.

    * and if that assumption is wrong, back to the classic issue of idiots redefining words simply so that they can no longer understand and learn from research that has already been done.

  3. Anonymous Coward
    Anonymous Coward

    Dates and elapsed times

    Last time I checked, the 1980s were 40 years ago rather than 60!

    1. that one in the corner Silver badge

      Re: Dates and elapsed times

      Yeah, typos be we today. Or bad editing, take your pick.

      I was originally going to talk about the earlier representations that were discussed in the 1960s but then changed to the 80s 'cos that period was coming up more often in relation to the specific term "knowledge graph", as opposed to structures that do much the same job but weren't referred to as such. The 80s settled on the term because "knowledge engineering" was all the rage back then.

      But then I forgot to adjust the other "60" down to "40": bad editing.

      But on the bright side, I *do* how to tap on "Reply"...

      1. TheBadja

        Re: Dates and elapsed times

        I just assumed your AI couldn’t do maths.

      2. Anonymous Coward
        Anonymous Coward

        Re: Dates and elapsed times

        That last sentence no verb, incidentally.

        1. djnapkin

          Re: Dates and elapsed times

          > That last sentence no verb, incidentally.

          I assumed that was some new cool phraseology i hadn't yet encountered

      3. JohnSheeran
        Trollface

        Re: Dates and elapsed times

        *know how

    2. IanRS

      Re: Dates and elapsed times

      No, the 90s were 20 years ago, the 70s were 30 years ago. How old am I now? No, I can't be!

  4. Steve Davies 3 Silver badge
    Pint

    Oh what a great idea

    We need more people to poison these so called AI systems data. The more unreliable data they have to work with the better IMHO.

    Hey virus/malware writers... Wanna get on our good sides? Pollute these AI datasets wherever you can. If you do, you can have a virtual one of these on me.

    1. Sudosu Silver badge

      Re: Oh what a great idea

      This is just starting to take off in the music industry as they are aggressively hoovering up everyone's music to make sloppy copies.

      https://www.youtube.com/watch?v=xMYm2d9bmEA

      Takes a lot of GPU but it seems to work.

    2. LogicGate Silver badge

      Re: Oh what a great idea

      “with all governments everywhere tightening down on everything wherever they can, with their computers and their Public Eyes and ninety-nine other sorts of surveillance, there is a moral obligation on each free person to fight back wherever possible—keep underground railways open, keep shades drawn, give misinformation to computers. Computers are literal-minded and stupid; electronic records aren’t really records…so it is good to be alert to opportunities to foul up the system.”

      ― Robert A. Heinlein, Friday - Published in 1982

      Not everything he wrote aged well, but still, the man often saw where things were going.

  5. Blackjack Silver badge
    Trollface

    Just add invistext with random text copied from X, that's not poison, that's nuclear waste!

  6. Claude Yeller

    It's encryption, Jim, but not as we know it

    Hiding information behind a key is simply encryption. Maybe bad encryption, but still encryption.

    And we know what happens to encrypted information and secret knowledge. It gets lost.

    At some time, a key gets lost or the bits corrupted and gone is the knowledge.

    Personally, I only want to get involved with open access data/Creative Commons.

  7. Claptrap314 Silver badge

    Criminal behavior --> criminal org

    I really don't understand how this theft of IP is being tolerated.

    1. sarusa Silver badge
      Devil

      Re: Criminal behavior --> criminal org

      Laws are of the rich, by the rich, for the rich. In the US it's big evil corporations. Once the big evil content producers link up with big evil AI corps, like Disney is doing with OpenAI, then this sort of content poisoning will be considered industrial terrorism.

      In Mainland China, AI is a pillar of the Chinese Communist Party's plan for world dominance and everything (by law) belongs to the CCP so there's no actual theft. I suspect if this poisoning takes off, it will be crushed, or the CCP will just demand you hand over all the keys as they do with other things. So govt approved AI will get all the keys - I suspect that's why there even ARE keys in this case.

    2. Claude Yeller

      Re: theft?

      "I really don't understand how this theft of IP is being tolerated."

      Copying ideas and words is not the same as stealing apples. Sharing knowledge is multiplying knowledge.

      Copying ideas is what every child and adult does for a living. IP is this insanity that wants to abolish freedom of speech. It criminalizes singing and speaking. It forbids you to help and teach your neighbor.

      Every word and idea you write was copied from someone else. Someone who also copied it from someone else again.

      IP is erecting toll boots for rent collection.

      1. kmorwath

        Re: theft?

        Copying without thinking will just hinder and destroy knowledge, won't multiply it. People **build** on previous idea, not copy only. Those who hoard only are just freetards.

        Nobody has been sued for singing the latest hit under the shower - if you copy it for money (even becaquse you don't have to pay) you're stealing - write your own song, or pay the rights.

        People MUST also be FREE to decided what to share and how. Otherwise you're removing THEIR freedom - remember your freedom ENDS where somene else's freedom begins.

        1. Claude Yeller

          Re: theft?

          "Nobody has been sued for singing the latest hit under the shower - if you copy it for money (even becaquse you don't have to pay) you're stealing - write your own song, or pay the rights."

          Singing happy birthday at a birthday party was a public performance of a copyrighted work of art. It earned some company big money until the copyright was voided in 2016.

          There is a big industry going after YouTube posters that capture some songs. Documentaries have a difficult time as any piece of music or art that happens to get recorded in public places leads to financial problems.

          Copyrighted works are everywhere and avoiding them is impossible. Every melody or song you sing or whistle can lead you into trouble as Happy Birthday showed.

          People are silenced worldwide with bogus trademark suits, as Apple tried to do with a German cafe "Apfelkind" (search for it).

          1. kmorwath

            Re: theft?

            From you own link:

            "the company began charging fees for its commercial use, collecting around $2 million annually. This meant that any film, TV show, stage production, or even restaurant that used the song for **commercial** purposes had to pay for the privilege." (bold mine)

            Again, "commercial use" - people making money from its use. Now you can be very greed and go after even restaurants...

            "The case revealed that, while Warner/Chappell did hold certain copyright registrations, there was no clear evidence that the company actually owned the rights to the lyrics"

            So they were fraudstes, not rightful copyright owners.

            Apple tried to copyright round corners, but they are a cult, not even a company.

  8. BasicReality Bronze badge

    I've tried ChatGPT & Gemini, researchers don't need to poison the data.

  9. ecofeco Silver badge
    FAIL

    We had to destroy the village...

    ... to save it.

    Dear god. From the ideal of clean data to deliberate sabotage.

    We are so screwed.

    1. JoeCool Silver badge

      Re: We had to destroy the village...

      Remember when "data wants to be free" ? Looks like AI has changed that.

  10. FuzzyTheBear Silver badge
    FAIL

    Theft

    AI is about theft of data , copyright infringement, It's massive but it seems the US is ok with it ! Well they are ok with abducting Presidents .. invading other countries , stealing their natural resources .. data is nothing compared to the lawlessness of that failed experiment.

  11. SoulFireMage

    Just adds a new evolutionary pressure in the intelligence arms race. In other words, if you have poisoned data like this, you simply provide known ammo for others to work out methods of figuring that out, automatically. So the collective intelligence available jumps. It's a short term type fix only. There isn't a long term fix I don't believe either.

  12. Anonymous Coward
    Anonymous Coward

    AVAILABILITY OF BLACK SOVEREIGN FOR ME

    Cannot seem to find any - I AM SO PRO SMOKING IN THE POWERS OF INFINTE U+221E ...

    LESS PRESSURE IN KANSAS

  13. Anonymous Coward
    Anonymous Coward

    777 GAME AT ONE - REAL STUFF IS THERE

    17ers still in play.

    1 controls the Gnomes.

    MCHAMMER Can't Touch This

    1. Expect Great Things
      WTF?

      Re: 777 GAME AT ONE - REAL STUFF IS THERE

      Hmmm. That’s the general idea, sure, but it’s not exactly “subtle poisoning”, is it?

  14. kurios

    I don't know why, but this story immediately brought to mind the US government's efforts in the late '70s to stem the use of weed by contaminating it with paraquat.

    The question "What could possibly go wrong?" seems appropriate here.

    1. djnapkin

      Crikey I'd forgotten that but yes that, the paraquat poisoning, did happen, can confirm.

  15. Richocet

    As someone who works with corporate data, it is difficult to spot bad data in a large data set, and costs significant money and time.

    I can see why this technique will be effective at deterring people stealing the data to train AI. The best options for the people building the AI are: Pay for quality data (result good AI), avoid using that dataset (less good AI, but not seriously broken), or use the poisoned data and get poor quality AI that no-one is willing to pay for.

    I work in the space of marketing data. Poisoning of marketing data that the data brokers trade in would be a big problem. There are records for over 1 billion people, and those companies all have an allergy to employing people to check quality, so they would not spot the data quality being degraded. Cleaning such a large data set would cost more than the data set was worth after a much lower amount of poisoning than the examples in this article.

    1. ecofeco Silver badge

      Hoisted by their own petards.

      Hilarious.

  16. retiredFool

    Why bother

    Already screws up so bad as is. I was on yahoo's finance site and looked at JOBY's ticker. The AI helpfully recommended some healthcare article under a different ticker to get more info on JOBY, a VTOL aviation company. All the stats and graphs were correct for JOBY on the page. At least yahoo has not enshitified that part with AI (yet).

  17. Ernst Blofelt

    It's not hallucinating, unless you've fed it drugs, its talking Sh!t€, I do wish people would use the correct terminology when reference the death of humanity.

  18. goblinski Bronze badge

    Can't remember who wrote that short story in the eighties...

    It was about some cash money traffic in sealed briefcases from Istanbul to Copenhagen or some other Western capital. The briefcase was traveling by train and would be passed to a different courier at each border crossing. It was a mostly elaborate thing where the senders knew little about who the courrier would be.

    The Bulgarian courier thought himself the smartest - he found a way to skim cash off the top without disturbing the seals, knowing full well that the theft would be discovered only at the endpoint once the briefcase gets unsealed and opened. So he got a little habit of doing it.

    The story ended with news of a man dying in a briefcase explosion in a train in Bulgaria, close to the Serbian border. Turned out the organizers didn't need to know who the courier was. They just needed to boobytrap one shipment and let the end recipient know.

  19. Anonymous Coward
    Anonymous Coward

    Omitted details

    The article doesn't really ever explain how the key unlocks access to just the desired components of the knowledge graph — that is, how it disables the poisoned components.

    The details may be highly technical, but surely *something* further could be explained? (E.g. for encryption, one doesn't have to get into explaining the mathematics of a particular trapdoor function to explain the more generally how it's applied.)

  20. Brl4n

    tired yet?

    Tired of all the productive things AI is doing yet? This has become a circus within a circus.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon