Another Implementation
One AI application is too many for me and a thousand are not enough, OK I guess I'm seen as a drinker with an AI problem.
Those spiffy AI systems that tech companies keep promising require mountains of training data, but high-quality sources may have already run out—unless enterprises can unlock the information trapped behind their firewalls, according to Goldman Sachs Training data is the Achilles heel of massive new AI models, as detailed by …
Goldman Sachs discovers model collapse and thus hurtles into the 2nd quarter of 2023. Just wait until they start reading 2024's research, because it doesn't get more optimistic with time outside of the marketing spiel.
How did people write dictionaries and encyclopedias? By paying people (and volunteers) to collect and write down the data. [1]
Now that the AI gangs have scraped all of Wikipedia several times over, and all of arXiv et al., they want to scrape the barrel by breaking into private data stores.
But enterprises that spend multiple $100B on creating AI models, could pay a few (tens of) million people some dollars to speak their mind and tell them what they know.
In the end, AI will have to be based on new written and spoken data to keep up with the times.
But to be able to do that, the AI gangs should get rid of the robber barons that want to rob and never want to give back anything.
[1] Read The Professor and the Madman: A Tale of Murder, Insanity, and the Making of the Oxford English Dictionary by Simon Winchester
Some players are already getting access to your data, and have been for several years.
Why do you think that you get free cloud storage from the likes of Microsoft, Google et. al. and why Microsoft are trying to get all Windows 10 and 11 users to sign up to mandatory cloud use by forcing Microsoft accounts and backup as part of OS setup and registration for extended security updates.
Check the terms and conditions you signed up to (and the modifications that are your responsibility to keep updated with) on all of these services. Oh, yes. They'll say that it won't be sold or used for marketing purposes, but I'll bet you that in the small print there will be something that allows the data to be scanned by that company's AI model.
Everyone who asks for you to set up and account, including organisations like the BBC1 are doing it. Next they'll be wanting Digital ID!
TANSTAAFL.
1 This year, for the first time, you need to sign in to a BBC account to vote on Strictly Come Dancing. No telephone voting!
LinkedIn recently asked me if I wanted my data to be used to train AI. Obviously, no.
But what's interesting to me is when I do look at that platform I see a place where 50% of the posts are already AI generated, so I can't imagine there's a huge amount of useful data there. Filtering out the AI noise would be hard.
It is all clear now !!!
We all see 'AI' failing to reach the heights of the marketing drones 'pitch'.
Is it the models ??? ... [Make a bigger one !!!???]
Is it that you are not asking the right question in the right way ??? ... [Train the users to 'Prompt Engineer' !!!???]
Is the model going 'off piste' ??? ... [Improve the Guardrails !!!???]
NO NO NO NO !!!
The problem is the Data is 'No Good' !!!
Quelle Surprise ... NOT !!!
You 'hoover' up the interWebs ... scrape off the green mouldy bits ... and process it into your LLM ... what do you expect !!!???
I am sure someone said ... "This is not going to work !!!" at least once !!!
Do you believe them now !!!
Once you improve your data by curating it well, you will be able to use the LLMs to 'make up' even better and more convincing excuses why 'AI' does not work.
Hurrah ... all is well in the 'AI' world !!!
:)
Having more "data" than the combined staff of any university department (or university) have seen in their combined lives is apparently not enough to give reliable advice.
This holds for every and any subject matter.
This tells me something is completely wrong with the neural models they use.
"This tells me something is completely wrong with the neural models they use."
Therein lies the answer to your conundrum ... in full sight yet seen by no-one ....
'AI' is just 'modelling' and models can be wrong !!!
The assumption is that if you process enough data through the model during training it will get better i.e. it will fool the public it is 'AI'
Evidence to date is that the assumption is wrong ... BUT the tech/'AI' behemoths will not back down ... they want it to be true !!!
Science does not work that way ... what you want is not what you always get ... sometimes the assumption was WRONG !!!
Real scientists/engineers/etc learn from the failures and try again ... NOT keep selling the vision without any hope of success.
The marketing drones are inventing the future without it being supported by what can be done NOW or even SOON !!!
You can only get away with selling the vision for so long ... before the mug/punters ask for what they paid for !!!
To resurrect a slogan of old (which was/is known in the US of A, so I am told).... "Where's the beef !!!???"
:)
"Evidence to date is that the assumption is wrong ... BUT the tech/'AI' behemoths will not back down ... they want it to be true !!!"
That's not enough: They've bet everything they have for it to be true. So they will lose everything, including their shirts, when it's not true.
Or specifially, their companies and customers will. The hype masters have their personal money in some safe place: These people aren't daft, they're liars.
When they do stupid rounding and math like this
https://www.theregister.com/2025/06/30/ai_models_favorite_number_27/#:~:text=True%20randomness%20is%20hard,or%20pseudo%2Drandom%20number%20generator.
That’s not learning, it’s not intelligence - it shitty statistics at best.
The AI dog has eaten all its own dogfood and now reduced to eating its own shit which is not unknown of pet dogs but not very nutritional for either kind.
Why don't humans suffer from model collapse ?rhet.
That is a simple question and the answer isn't solely that humans mostly don't eat shit (talk and write it – yes.)
The AI peddlers are keen to paper over the unbridgeable gulf between that which the biggest and best LLMs are actually capable and the simplest capabilities of even the dimmest of our species (and that really is pretty dim.)
Undoubtedly, the technicians stuffing new AIs indiscriminately with 'content', don't attempt to question the provenance and quality of materials they handle. Presumably, when the Internet is scraped, there is negligible human oversight in that regard.
Powers of discrimination once marked the educated man. No longer. It's deemed improper to categorise people, lifestyles, and, by extension, information, in terms of quality, desirability, usefulness, and worthiness for attention. Similarly, 'truth' (or better, William James 'cash-value of an idea') is determined by the number of believers rather than by appeal to logical antecedents.
We inhabit a world of deliberately induced mediocrity (and worse still for the bulk of humanity). Peddling tawdry values via the entertainment industries and social media, is highly profitable; yet, the end-aim runs more deeply in the ambitions of the WEF.
Discrimination is still deemed perfectly proper. You just have to be discriminating in your criteria for discrimination. Typically, discrimination to the detriment of a person on grounds that a person cannot change about themselves is now reckoned to be of doubtful morality. Depending on how seriously it affects them, you are expected to show restraint. But intelligent and polite people still manage to differ on these points and have civilised arguments about their reasons.
It's just the loud-mouthed tossers you need to watch out for. But that was ever thus.
With all the scraping and pirated book downloads, it is highly unlikely that the LLMs didn't get to read all the "input" that we did. From "Topsy and Tim"[1] or "Ant and Bee" through to all the Uni textbooks, 'Horowitz and Sahni' or 'Winston and Horn'. Plus a lot more: instead of just wandering around town and hoping to remember where all the streets are, they had the Lonely Planet Guide to London, similarly for pretty much everything else we've learnt. Plus yet more: all the places we've not been to, all the study courses we didn't take, the languages we don't speak, the pictures we haven't seen (yet), the museums we haven't been to (yet). We read all the Asimov and Clarke, it had that plus Dan Brown[2], Hubbard and Meyer - on the brighter side, it also managed to finish wading through Dostoevsky, Thucydides and Lessing[3].
So it had everything we did, an awful lot more - and yet it isn't up to snuff.
But instead of looking into why[4] and how to do better, just blame the failings on a "lack of input"! And where can they get that input from? Why, from inside all of the companies and institutions they are trying to flog the LLMs to! "See, you are viciously hoarding *just* the thing needed to make this LLM useful to you and everyone else! It is all your fault."
Of course, if they do convince people to hand over the goods (or convince others to pressure for that data, because they've been dragged into the "must use LLM" mire and is easier to join the blame game than to eat sunk costs) and the LLMs are *still* no damn use, what are they going to blame next?[5]
[1] well, probably not the classic "Wednesday Book" etc but the modern remakes that are available as ebooks
[2] oops, sorry, we're supposed to be talking about "high quality input"
[3] I know, they are classics, guess I'm just shallow
[4] and we know why, no need to dive into that *again*. Suffice it to say these things can't think, don't contain any ability to apply methods described in all those textbooks to the bit of data you provided in your query.
[5] maybe I've been watching too many Youtube videos, but I'm thinking it'll descend into paranoia (institutions are holding out on us), executives issuing orders to demand AI companies are given access to ensure countries dominate etc etc
What's next? Hmm. Well once you have used up all the training data in the world you have two options:
1, the AI bubble bursts
2, you train on AI-generated data and keep going for another year before model collapse and the the AI bubble bursts
We're getting close. My guess is that the idiots will go for 2 and so it will be 2027 when the bubble bursts. Plan your pension investments accordingly.
"it is highly unlikely that the LLMs didn't get to read all the "input" that we did."
... which is the problem, per se, as LLM has no way to 'know' anything and any statement anywhere is as 'true' as any other statement and Internet is like Bible: Every statement has exactly (or effectively) opposite counter-statement.
Adding more data makes it even more noise, not less.
But naturally these people don't and won't understand that.
Someone help me out here. I don't really understand how AI works. But if I'm trying to come up with a medication to conquer the heartbreak of psoriasis, why does my AI agent need to know all the wives of Henry VIII, the full text of Shakespear's plays, the geography of the Malay Peninsula, the dietary needs of the Galapagos Tortoise, the full score (words and music) to "My Fair Lady", and all the junk ever posted on Facebook? Shouldn't all the available knowledge of physics, organic chemistry, and medicine suffice?
Ding Ding Ding ..... give that man a Kewpie Doll !!!
That is the problem they cannot solve ... because they do not know how it works they feed in EVERYTHING they can get.
The hope is the the junk is ignored when getting your answer.
You want a new Drug then the 'AI' will ignore the comments on El Reg BUT may include the comments on the latest paper on Drugs that are effective against the most persistent 'Super-Bugs', then again it may not BUT action the recommendation to see the latest xkcd comic in a comment on an article discussing 'AI' & Drug Design and then try and process that !!!
'AI' is the ultimate manifestation of 'Self modifying code' ... as you feed in more data it changes the rules to how it works for certain inputs/questions (AKA Prompts).
What is there today is not there tomorrow, unless there has been NO data or prompts processed.
Each prompt when processed impacts the 'model' and this means the rules have changed in some way BUT this change cannot be understood without totally dismantling the 'model' ... value by value, index by index understanding each purpose within the model as a whole.
It is a 'clever pattern matcher' or an 'educated guessing machine' for an unknown value of 'Educated' & 'Guessing' !!!
Instead of having a 'man behind the curtain' ... we have elected to have the 'wind randomly blowing the curtain' do the job instead.
You can only tell the difference if you 'pull back the curtain' and see what is there !!!???
That is the 'AI' we are heading for ... at a rate of knots that is dangerous !!!
:)
"if I'm trying to come up with a medication to conquer the heartbreak of psoriasis"
I cannot see how AI/LLM is supposed to help here.
If I were commencing this task I would search the medical-scientific literature to gain a broad view of as many aspects of the condition and the various attempts at treatment as possible. Hopefully the (recent) literature could shed some light on the underlying causes and consequently pathological processes as well as predisposing and aggravating factors and rationales behind current treatments.
Fundamentally I am attempting to synthesize an a broad understanding of the whole condition and treatment in order to apply the better options to treating a sufferer or to inform further research to discover more of the underlying pathology which could be used to develop new treatments or fine tune the existing ones.
It is difficult enough sorting through the chaff of low quality papers and studies without intercalating AI.
FWIW: both oatmeal baths and pine tar based treatments have their proponents.
"if I'm trying to come up with a medication to conquer the heartbreak of psoriasis"
I cannot see how AI/LLM is supposed to help here.
I do, and the OP is right : when you know what problem you're trying to solve using "AI", then you can feed it relevant data only. For example, if you want to triage the millions/billions of galaxies that modern telescopes can see, into some relevant categories, your model won't need to know any Shakespeare novel. It's enough to feed it thousands of examples of galaxies with their classification, and it can handle the rest.
The problem with the all-purpose generative LLM AIs is that they pretend to answer every possible question, and thus need to be fed with every possible subject.
... to help Ai companies to sell service at an higher price, and also to competitors....
If those data are "trapped behind firewalls" there's a reason, often a good one.
Moreove in EU a lot of those data - customers ones, etc. - are protected by GDPR, not copyright alone.
Even if they would give it up, what value is there to LLM training of that private data? The big databases of any customer facing company are always reservoirs of dirty and inaccurate data - certainly were for the various large companies I worked for. Where some inference is needed (eg creditworthiness) there's already established algorithms that do that, and if LLM developers want to include that, then they can easily buy data sets for that, and combine with other data (free or stolen) as they see fit.
The hand-wringing "we need more data" is bollocks - LLM are just statistically driven word jumblers that predict a probable "answer" to a given request. Putting more data in at the front won't make any difference.
“LLM developers want the world, but for free - if they have to pay for it, it ruins the business model.”
I dunno, because of LLM, suddenly we all have annas archive. They’re trying to put together a “free” database including every single book in the isbn database.
There are not only sales date - there are a lot of intenal data that of course are part of the company "knowledge" which in turn is often what makes some companies successful - and others not. Can you see companies letting AI peruse their product blueprints and recipes, test/QA result, business plans, internal researches and audits, market researches and their results, etc. etc?
Why they should put that knowlegde in AIs accessible to others? For the benefit of mankind - where "mankind" means Altman/Musk/Ellison?
>.. The big databases of any customer facing company are always reservoirs of dirty and inaccurate data
Ah so you have seen it as well: "all our data is fragmented all over the place, we need to sort it so we can find stuff - this new database should sort it"...
.. one new humongous database later........they still can't find the data !
This is why MS is pushing AI in all of its products: because this way they can infiltrate enterprises and at one point they will start (or they already do) sucking data from "private" stashes to train their AI.
At this point in the rise and fall of LLMs, it's more important to have fresh data than GPU power.
And since the internet is dead (as in the dead internet theory) since quite some time, there is no more good data to scrape in the public domain. The data that was worthy has already been scraped (multiple times, by multiple entities) and has already been diluted into an outflow of AI slop so in the end it's useless.
90% of current web traffic is AI suckers (disguised as browsers) and 50% of web content is AI slop (disguised as articles and blog and social media posts)
"need humans to monitor them"
We tradesmen do that already with apprentices. Not sure why people don't see that parallel in the more white-collar fields.
A bit of Schadenfreude, we were told to 'get with the times' as the CNC machines slowly automated away our skillset for profit... Your turn.
I like this bit "Cleaning your data, normalizing it, having the semantics of the data understood, all of this stuff is what's going to allow enterprises to level up,"
Very true, if all our silos were neatly categorized and properly stored and available, then we'd have all sorts of ways of analysing and extracting meaningful information from it - no LLM required, of course we'd have to lock it down to a very few people otherwise we would be leaking all the secret sauce right then and there off the battered fish and right on to the surrounding newsprint.
"Trapped" enterprise data?
It's already starting to sound like oil and gas exploration, always having to find harder and harder to extract sources of hydrocarbons. Doesn't really matter if the AI fracking causes all the groundwater of actual, real human interaction in the exploited area to turn toxic, progress must be served
Soon all that will be left to "liberate" is human brain activity itself, perhaps stimulated with virtual reality simulations and harvested to nourish and repair the models that automatically run the world, to keep them from suffering lethal model collapse...
Ah, that is how the world of Matrix came to be.
Some even say this is what those ICE detention facilities are REALLY for... the XCOM/Matrix crossover no one asked for.
For anybody that thinks that allowing any supposed "AI" near real organisational data is a spiffing idea, I suggest they read the cautionary tale of how allowing an externally controlled "AI" agent (Salesloft Drift) complete access to any/all data that it can access without limit (presumably because the developers were too lazy/stupid to demand anything other than full access permissions and privileges or is this just the "break things" part of "move fast and break things"?) resulted in a massive data breach.
https://www.wtwco.com/en-us/insights/2025/09/the-drift-oauth-breach-a-cybersecurity-wake-up-call has the "skinny" to use the vernacular.