Re: Quality data
we have been scraping transcriptions from YouTube for about 2 years now. We filter it quite easily cause auto-gen doesn't have punctuation, whilst human edited does.
However, it is nearly 100% certain that auto-gen punctuation will come very soon and the reason YouTube are holding back on it is because it will be Scrape City when that happens.
"... generated subtitles would just be a synthetic dataset anyway, so you may as well generate a real synthetic dataset of videos with subtitles instead of a half-real/half-wrong one."
Had to ask Gemini to make a dunces version to understand it. Not that I don't understand what you are saying, just that it is difficult to extract from the sentence. Even Gemini struggled but agreed with you.
And the reason I pulled this one comment on my day off is that this is the first roundabout mention of synthetic intelligence (SI) on here that I have noted and one which, whilst not quite there, does capture the point of unfathomable computing.
Let the machines make their own 'language' - not the abstract processing of data that we currently instruct it to do and then await the reports with just an overview of how it did it.
The more work I do in this area the more I am reminded of the Improbability Drive.