Why (how?) do the LLMs crawl behind the paywalls, in the first place?
OpenAI pauses Bing search feature over paywall bypass abilities
OpenAI's experiment with allowing ChatGPT to search the web via Bing has been suspended because the feature inadvertently allowed users to bypass paywalls. First rolled out in May and limited to paying ChatGPT users, OpenAI updated its help page for the "Browse with Bing" feature yesterday to indicate that, as of July 3, …
COMMENTS
-
-
-
Wednesday 5th July 2023 19:08 GMT DS999
Some paywalled sites allow web crawlers (I'm guessing OpenAI uses a web crawler similar to Googlebot to collect its data) to access the full text, so that searches will return their site more often. The problem comes when you see that site in a search, click on it, and find nothing to do with your search terms in what you can see without paying.
-
-
Wednesday 5th July 2023 20:13 GMT FILE_ID.DIZ
Well... might not be that simple.
Google, Bing and DDG, for example, maintain lists of their bot's IP ranges. Google, Bing and DDG also seem to use FCrDNS with a specific domain (googlebot.com and search.msn.com) and DDG uses duckduckbot-X.duckduckgo.com where X is an integer it seems.
Bing bot IP ranges - https://www.bing.com/toolbox/bingbot.json
Google bot IP ranges - https://developers.google.com/static/search/apis/ipranges/googlebot.json
DDG bot IP ranges - https://help.duckduckgo.com/duckduckgo-help-pages/results/duckduckbot/
And as the saying goes - the "good guys" always have to be right. The bad guys (in this example, trying to circumvent a paywall) just have to be right once.
And I have to imagine that writing a middleware to update whatever application(s) is/are responsible to allow spiders in based on third-party provided, non-RFC standardized formated IP data might be harder than just looking for a UA string. At least until someone in the bean counter department notices.
-
-
-
Wednesday 5th July 2023 19:11 GMT heyrick
Thank god
Sometimes Bing can find stuff that Google is convinced doesn't exist.
The use of AI completely buggered up using Bing on my phone. Pressing the Enter arrow would give a new line in the text rather than starting a search, and there was no obvious on-screen method to get it to actually begin searching instead of trying to get me to edit the text.
At least, for today, the Enter button on the on-screen keyboard is a magnifying glass and it does what it is supposed to.
-
-
Thursday 6th July 2023 13:01 GMT irrelevant
Cooyright
"I am not able to display the full content of articles from [the site you requested] or any other publication that is protected by copyright,"
So that's pretty much every single thing on the Internet, then, given almost everything you can find is copyright somebody or other. Even if they make it freely accessible. And it's even more of a stupid thing to say given they trained their LLMs on (copyrighted) data in the first place, without asking, which many people are upset about.