Models trained on transcripts is going to be very interesting, because I'm not sure if I've ever seen a transcript that aligns 1:1 with what the speaker is saying. Every transcript has at least one word or phrase used differently than what was said. I'm sure with enough data that the subtle differences won't amount to much deviation, but it'll be curious if any of it manages to throw a wrench in a model, since apparently you can just straight-up perfectly replicate things models are trained on.
If you think AI labs wouldn't stoop to using scraped YouTube subtitles for training, think again
FYI: It's not just Reddit posts, books, articles, webpages, code, music, images, and so forth being used by multi-billion-dollar businesses for training neural networks. AI labs have been teaching models using subtitles scraped from at least tens of thousands of YouTube videos, much to the surprise of the footage creators. …
COMMENTS
-
-
Wednesday 17th July 2024 11:24 GMT imanidiot
Depends on the source of the transcript. Some youtubers (especially the larger ones) will have either volunteers or paid staff manually transcribing the video and time coding it to the video, which is generally fairly accurate. The auto generated subtitles by Youtube itself... utter garbage for anything other than the clearest mid-atlantic accent spoken English.
-
Friday 19th July 2024 20:47 GMT druck
They'll have fun if they use some of the Japanese cartoons my kids used to watch on YouTube, they were obviously dubbed in to English and subtitled by two completely different firms, so the names and phrases in each didn't match up at all, you really had to struggle to comprehend they were telling the same story even though each vaguely matched the animation.
-
-
Wednesday 17th July 2024 02:59 GMT Bendacious
Quality data
As mentioned above, anyone who has tried to use the subtitles on YT will know that they are almost useless. In my experience they have a 100% record for getting technical terms wrong and cannot cope with anything outside simple words spoken with a mild US accent.
As these subtitles are automatically generated, this is a situation discussed often in Reg comments - AI training data produced by AI. I can’t believe the data is not poisoned in many places, which kinda serves them right.
On a separate topic, no one abuses YT creators as much as YT. They recently enabled adverts on people’s channels where the creators had chosen not to advertise, without any option to avoid it. Big channels used by children and even babies (Miss Rachel, for example) now show pre-roll and mid-roll adverts against the wishes of the channels. Evil bastards.
-
Wednesday 17th July 2024 03:55 GMT Anonymous Coward
Re: Quality data
I could be wrong, but I'm *pretty* sure this is talking about human-written subtitles, not the auto-generated ones. Everyone knows the auto subtitles are trash, even YouTube, that's why you can still make your own. Some YouTubers even hire dedicated transcribers for their videos. Besides that, generated subtitles would just be a synthetic dataset anyway, so you may as well generate a real synthetic dataset of videos with subtitles instead of a half-real/half-wrong one.
-
Wednesday 17th July 2024 07:23 GMT Anonymous Coward
Re: Quality data
we have been scraping transcriptions from YouTube for about 2 years now. We filter it quite easily cause auto-gen doesn't have punctuation, whilst human edited does.
However, it is nearly 100% certain that auto-gen punctuation will come very soon and the reason YouTube are holding back on it is because it will be Scrape City when that happens.
"... generated subtitles would just be a synthetic dataset anyway, so you may as well generate a real synthetic dataset of videos with subtitles instead of a half-real/half-wrong one."
Had to ask Gemini to make a dunces version to understand it. Not that I don't understand what you are saying, just that it is difficult to extract from the sentence. Even Gemini struggled but agreed with you.
And the reason I pulled this one comment on my day off is that this is the first roundabout mention of synthetic intelligence (SI) on here that I have noted and one which, whilst not quite there, does capture the point of unfathomable computing.
Let the machines make their own 'language' - not the abstract processing of data that we currently instruct it to do and then await the reports with just an overview of how it did it.
The more work I do in this area the more I am reminded of the Improbability Drive.
-
Wednesday 17th July 2024 10:39 GMT Brewster's Angle Grinder
I agree this sounds like human-written subtitles.
But, in this case, using AI generated ones wouldn't be a problem: because the purpose is to label audio with text. Google will have invested $billions in developing an AI that does that. You don't have those resources. But by scraping YouTube videos, you can get the results of their training - and use it to train your model to a similar standard. Then you use your leftover resources to try and better Google's effort.
This situation is not like training a model to write articles based on LLM generated text. In that case, the input is connected to the output so you get feedback skewing the results with every iteration. But this is a transition from one form to another where you are trying to piggyback somebody else's work.
-
-
-
-
Wednesday 17th July 2024 07:47 GMT Anonymous Coward
Re: "quantum chromodynamics" to "flat earth."
!they promised that the internet"
They? 3rd person pronouns in the presence of that 3rd person when they are within earshot is consider poor form.
We did promise the sum of human knowledge. I personally promised the sum of human knowledge in your hand in 1999 to my girlfriend (talking about mobile internet here)
And now, we, many on here who have built this modern world... we are working on providing hallucinations to you. LOL. Wait until you here about self-referential paradoxes creating true/untrues that expose the limitations of formal logic systems. You are going to love it.
-
-
Wednesday 17th July 2024 08:58 GMT Dan 55
EleutherAI
So this organisation that Apple, Salesforce, and Nvidia invest in says they generate "open source" training data which Apple, Salesforce, and Nvidia just so happen to use. Otherwise known as outsourcing blame and slapping an "open source" label on something so it sounds wholesome.
Anyway, if they scraped data from YouTube's automatically generated subtitles then good luck to them. They're already at the stage of ML training from ML data from human-generated data so gibberish is guaranteed.
-
Wednesday 17th July 2024 10:21 GMT Pascal Monett
"the internet giant puts a lot of effort into thwarting unauthorized scraping"
Oh sure it does. That's why there's already a 5+ GB dataset of scraped data.
Honestly guys, can't you see that are fooling no one ? Your words are worthless because your acts have already spoken for you.
YouTube is your site. You have no excuse not to be able to lock it down, especially when you are continually messing with YouTube downloader addons. Apparently, those things bother you a lot more than subtitle scraping, because those addons are constantly updating to cope with your messing around.
But I get it : a downloader addon cuts out your ads and thus impacts your bottom line, and we can't have that, now can we ?
-
Wednesday 17th July 2024 11:56 GMT Howard Sway
The Pile includes data pulled from internal Enron emails
If our AI future is to be based on a mixture of emails from a huge corporate fraud and auto generated subtitles from Youtube product unboxing videos, that future is going to be very bizarre indeed.
Although if you're wanting to generate a sales campaign for an IoT smart meter that cons millenials out of their cash, it should do the job perfectly.
-
Wednesday 17th July 2024 11:58 GMT vtcodger
Job security ... for some
AI = Job security for whole generations of lawyers. I'm not sure whether that's good or bad. On the positive side, it keeps the wretched creatures from alternate activities. On the other hand it will likely encourage the breeding of even more of them. I'm not sure how they propagate. Spore's maybe.
-
-
-
Wednesday 17th July 2024 23:49 GMT Bebu
What next, nutrition labels on cartons?
I was thinking feed this monsterous technology the various EULAs that the twisted little minds of corparate lawyers have themselves hallucinated might be its Waterloo. ;)
The commentard who promised his girlfriend in 1999, the sum of all human knowledge in her hand evoked quite a different image in my mind from the nascent mobile internet.
She gasping: "Oh! I was hoping it would be bigger."*
Although the sum of most human's knowledge could be inscribed on a single grain of rice in blackletter and still leave space for footnotes.
* Why was I thinking of the late Frankie Howerd, I wonder?
-
Thursday 18th July 2024 20:44 GMT Stevie
Bah!
Good luck with that.
In every single yootoob video I've seen with subtitles those subtitles are riddled with gibberish created by the speech-to-text software being almost good enough for primetime. The hallucination issue will fade into insignificance behind the nonsense used as training issue.
Naturally, the "creatives" responsible for the content are too busy (ie bone idle) to proofread their visual word salad - which is a problem I foresee getting well-and-truly out of claw when AI software starts authoring everything from stereo instructions to software modules.
Once humans start automating things they forget the need for sanity checks or won't afford the staff to do them. "Yippee! It's Free" thinking goes all the way down to the Earth's core.
The noise to signal ratio will look like a tangent curve plot over time, making for medical procedure documentation and legal contractese that could have been written by Donald Trump.
-
This post has been deleted by its author
-
-
This post has been deleted by its author