back to article Reddit to Perplexity: Get your filthy hands off our forums

Reddit on Wednesday filed a lawsuit against Perplexity AI and three of its alleged data dealers for trafficking in unlawfully scraped information. The complaint, filed in the Southern District of New York, claims that Oxylabs UAB, AWM Proxy, and SerpApi unlawfully bypassed Reddit's and Google's defenses to harvest Reddit …

  1. vogon00

    "called out Perplexity for running web scraping bot"

    I'm not a fan of AI,and lately not of Perplexity...for no other reason than the YouTube adverts for them that keep getting shoved at me.

    The theme of these ads is that "<Ai Name>is giving straight-up wrong answers" and that Perplexity "scans the entire internet in less than a second". The first is just competition-bashing, the second is plain old lying bullshit.

    1. mark l 2 Silver badge

      Well your problem there is not using an adblocker on YT. It's just become such a terrible experience that I honestly can't go on Youtube without one as you get 10 minute videos with 3 lots of ads these days. I do realise thats how a lot of Youtubers get paid though so if Ive liked a video someones created ill open it in my seconds browser without an adblocker installed and click on one of the ads to send them some money or if ive watched a lot of their content then let their video playlist play with the ads.

      1. Jellied Eel Silver badge

        Well your problem there is not using an adblocker on YT. It's just become such a terrible experience that I honestly can't go on Youtube without one as you get 10 minute videos with 3 lots of ads these days.

        I think YT demonstrates a lot of the problems with AI. So YT 'recommends' me videos, supposedly based on the billions AlphaGoo has invested in data harvesting, analytics and AI. That means the exciting game YT has developed, clicking on recommended videos and going through the 'Don't reccomend channel' to make content I'm just not interested in go away. Then next refresh, I get to do it again! Such fun!

        There is a 'Not interested' option, but that doesn't give any options to explain why I'm not interested. If it did, then it would be a simple and user-friendly way for YT to tailor content recommendations. Instead it just serves as an example that despite AI, analytics and user profiling.. AlphaGoo really doesn't understand me at all.. Which then extends to the ads, because when I do enable those, the ads are rarely relevant to either the video, or me.. and no amount of repeating the same ads over and over again are going to change my perception. And then the 10min video might also include a 2min sponsor segment.

        Which is where YT is maybe a AI canary. It's backed by the power of AlphaGoo, yet doesn't seem to do anything useful for either users, content creators or advertisers. So assuming AIs can be trained on YT or Reddit seems.. Optimistic. What would an AI actually learn from scraping Reddit content, apart from a lot of noise? And then with Reddit's often rather.. agressive moderation, would what it thinks it's learning be accurate, or biased?

    2. Anonymous Coward
      Anonymous Coward

      “Ben Lee, chief legal officer at Reddit, told The Register in an emailed statement that AI companies are desperate for quality content generated by real people and that need is fueling an industrial scale data laundering economy.”

      Although Reddit is orders of magnitude better than the cesspits of Facebook and Xitter… I’d question what ‘quality content’ they are getting…. though agree some of the posts on r/fryup - largely oriented around (British) breakfasts - hits the spot, esp. if there is some black pudding on the plate.

      1. vogon00

        "Reddit is orders of magnitude better than the cesspits"

        OK, but I still can't see *why* perplexity want reddit content. I gave up looking for answers and/or useful info on there pretty much as soon as I started:-) I'm sure there is some decent stuff on there, but you have to wade so much 'That didn't work for you, but it did for me, so you must be an idiot!' and nit-picking bullshit, not to mention downright 'wrong' stuff.

        I can't see what Perplexity hope to gain by trying to 'learn' from that 'noisy' environment.

        Unless, of course they've decided to turn out a product that suits it's name - perplexed!

        1. that one in the corner Silver badge

          Reddit is not a place for useful answers.

          However, it is great for winding up people who think they have a sense of humour but really don't; just go to any of the sections intended to mock the afflicted (r/tragedeigh is a favourite) and you will soon spot someone who is due a quick "let me just google that for you". Cruel and petty, but given the nature of the beast you are poking...

          Whether that makes Reddit useful as training data for an AI is an open question. Is there a place for a cruel and petty AI? Is there a place for an AI that mocks or for one that is mockable? Not that there is any shortage of the latter, as can be read about here on El Reg.

          1. Anonymous Coward
            Anonymous Coward

            It’s a much more agreeable place so far (or I have seen).

            No shit-posting or AI reels for starters.

        2. Anonymous Coward
          Anonymous Coward

          Like most intelligence - you generally consume pre-prepared learning materials, do some practicals/experiments, take some exams, either continue education or gain employment/ graduate entry position and learn by doing or experiencing.

          Language models, and trawling the slop of the Internet after consuming - but not understanding - the sum of mankind’s knowledge and running some lossy statistical models on it isn’t going to end well. Ever.

  2. Bran Muffin

    Am I understanding this correctly?

    "...we will always fight vigorously for users' rights to freely and fairly access public knowledge. Our approach remains principled and responsible as we provide factual answers with accurate AI, and we will not tolerate threats against openness and the public interest."

    The only way I can think to read this statement is that Perplexity is saying that they are going to scrape, scrape, scrape, whether anyone likes it or not.

    1. Anonymous Coward
      Anonymous Coward

      Re: Am I understanding this correctly?

      The other way to read it, is as it is stated. This is based on the idea that "information wants to be free" crossed with believing that it should be free to anyone who can work out how to access it.

      Yes, this is directly against most Copyright regimes, and many corporations believe that information (aka "content") should never be "free-for-all". You're only understanding this from the standpoint of "content belongs to the people who collected data from people, and then served it to other people."

      There's a fairly cogent argument that such content collections as Reddit and Twitter/Facebook/X/Social implement should not get exclusive rights to control (and collect rent on) all of said content in perpetuity. Very few redditors or original content creators on content aggregation social media get paid anything at all.

      1. Andy Tunnah

        Re: Am I understanding this correctly?

        That kinda ignores what the ACTUAL content creators want, and you'd be hard pressed to find anyone who was OK with some AI company scraping their creations, thoughts, comments, whatever and using it to pump up their AI bullshit stocks.

        1. doublelayer Silver badge

          Re: Am I understanding this correctly?

          They don't care what the content creators want. That's clear from the use of the old "information wants to be free" nonsense. While they talk about companies as the ones using copyright they disapprove of, you can count on their not having a markedly different opinion about individuals' copyrights. It just makes it easier to make the copyright should be eliminated argument if you only talk about less sympathetic copyright owners and make up facts like the allegation that Twitter and Reddit named themselves monopolies over distribution rights which neither have.

      2. Filippo Silver badge

        Re: Am I understanding this correctly?

        >There's a fairly cogent argument that such content collections as Reddit and Twitter/Facebook/X/Social implement should not get exclusive rights to control (and collect rent on) all of said content in perpetuity.

        They shouldn't. Those rights should always remain with the users. Doing things to my data without my consent is not freedom. It's the opposite of freedom.

        The argument that Perplexity should be able to download the whole of Reddit because Reddit shouldn't have exclusive rights to Reddit content is upside-down logic. Reddit should take steps to prevent Perplexity from doing this, specifically because Reddit does not have all rights on my content, and they definitely do not have the right to redistribute it for purposes not explicitly declared in their T&Cs.

        This is as if I picked one of those successful web novels that people post on Reddit, reformatted it, printed it, and sold it in libraries. The guy who originally wrote it would come after me, win easily, get all my profits and then some, and damn right he should.

        Publishing something does not mean putting it in the public domain; I get that some people believe that, but that diminishes freedom, because it means that rights are taken from me, not given to me.

        My rights are mine until I explicitly decide they are not. Taking them against my will isn't freedom, it's merely economic might-makes-right... and corporations will always win that one.

    2. that one in the corner Silver badge

      Re: Am I understanding this correctly?

      >> we will not tolerate threats against openness and the public interest

      Because, of course, if Perplexity wasn't scraping, members of the public would be unable to go to Reddit and read anything from it that interested them.

      Reddit is well known for being closed to public access, preventing even the most basic of searches through its content, never allowing working URLs to individual posts, taking every precaution it can to ensure random passersby never sneak a peak into its content lest they be tempted to scroll down and - r/aww look at the puddy tat! Oooh, other people who r/collectpenciltoppers. What? No, no, I'll finish writing the comment later, I've just spotted that somebody is wrong on the Internet...

  3. LBJsPNS Silver badge

    Any data scraped from Reddit is pretty much trash at this point.

    1. Anonymous Coward
      Anonymous Coward

      If an AI engine is scraping Reddit to understand the human condition then we're all doomed.

      I had the misfortune to wander into Reddit lately, specifically subreddits around entrepeneurship and business startups. It was so sad. Just desperate people trying to sell desperate business ideas to other desperate people. If it had a physical equivalent it would be a shabby old betting shop full of red-eyed tired old people and the floor littered with ripped-up failed betting slips.

      There are desperate people in this world and we should treat them with compassion - I'm not here to make peoples lives harder. But if you have a problem and you think Reddit is the answer, then you need lots more help than you think you do.

  4. Anonymous Coward
    Anonymous Coward

    dystopia

  5. This post has been deleted by its author

    1. Anonymous Coward
      Anonymous Coward

      Re: Test

      Publish and be damned!

  6. Jason Hindle Silver badge

    As much as I like the Perplexity Product

    Reddit is right. This needs to be consensual and equitable. Otherwise, we risk sources of training for AI models going bust/drying up.

  7. Blackjack Silver badge

    I don't like modern Reddit but I wish most companies did sue AI companies for pirating and stealing.

  8. FF22

    Reddit stole it first

    Reddit's argument is that it was the first to steal and monetize the user-generated content it didn't pay for, therefore Perplexity shouldn't have the same right. And Perplexity argues that it should also have the right to steal it and monetize it.

    In an ideal world both companies would be dissolved and their leaders jailed. Then again, we live in an opposite of an ideal world, lead by the US.

    1. doublelayer Silver badge

      Re: Reddit stole it first

      If Reddit went around copying it from elsewhere, your argument would make sense. They didn't, so it doesn't. Reddit did not steal content. People chose to post it and grant Reddit the limited rights stated in their terms.

      In general, I'm a lot less concerned about Reddit being scraped than most of the other training data that LLM companies have been stealing because people did agree to make that public, whereas plenty of the remaining training data was collected with even less consent. However, that is still not enough to give people permission to use it for unlimited commercial use and Perplexity's use involves putting a lot of costs on Reddit's servers.

  9. cookiecutter Silver badge

    perplexity being scum?

    colour me shocked

  10. Anonymous Coward
    Anonymous Coward

    How dare someone steal the user content before Reddit does it themselves!

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon