
Checked the transcripts
Most of it looked good, but the conversations seemed to be suspiciously often about upgrading to Windows 10.
Microsoft on Tuesday said that its researchers have "made a major breakthrough in speech recognition." In a paper [PDF] published a day earlier, Microsoft machine learning researchers describe how they developed an automated system that can recognize recorded speech as well as a professional transcriptionist. Using the NIST …
Jolly good for M$, but I have my doubts about the numbers.
11.3% errors for a human transcriber is appalling! Who did the transcription? Were they using human court stenographers? Audio typists? Hansard transcribers? If any of them had an 11.3% error rate they'd be out on their ear.
I also suspect that in real world situations when people are in a 'formal' situation or know they ned to speak clearly then the numbers would be better, both for the humans and the robots.
There are some good use cases where accurate audio transcription by software, ideally in real time, would be great - instant(-ish) translation of phone calls etc., ability to make audio files earchable. But it does need to be accurate - 99.5% and upwards, like decent OCR. Unfortunately it also means that our friends in NSA, GCHQ, the Kremlin etc can also do all these things, which is not so good!
I think 7/10 for effort.
11.3% errors for a human transcriber is appalling!
You can see it in real life, where transcribed text is shown to the public. For example, look at the details at rightmove or zoopla, transcribed by some numpty at an estate agent. Some agents seem to specialise in entertainment value.
Yes, it's actually garbage.
1) The real score on real world stuff will be lower.
2) Any competent Audio typist (that works with the same person) can beat a transcriber (remove source errors).
3) Perhaps they are comparing a real time stenographer? Even so it's a poor score.
Natural language parsing is the limit, it's simply nowhere near good enough to sport decent text to speech.
Dictation transcription (aka Audio typists), transscription not in real time of unknown source, speech/Film/TV/News subtitles in real time, and live stenography / shorthand with later transscription are all different activities. All rely on UNDERSTANDING the meaning as well as basic parsing.
This is shameless marketing.
>"Cortana evidently can benefit from further improvement. Last month, security firm Sophos advised against relying on Cortana for making emergency calls, based on an account of a UK woman who used the software to dial the local police in order to report an accident and was directed to authorities in the US."
What the hell is up with the Register shoe-horning in some tangentially (at best) related final paragraph into its articles?
Cortana transcribed the UK woman's speech perfectly. She said 'Barnstaple', and it transcribed that perfectly. The issue is that it supplied a telephone number for a police force in some USA town called Barnstaple, and not the town in North Devon*. There was an issue with Cortana in this case, but it wasn't in the area of speech recognition.
In any case, in an emergency you call 999 or 112 (works in Europe** and UK) - not the local plod's number. The emergency (999) switchboard are able to roughly triangulate your location if you are not able to describe it.
* If you are passing Barnstaple on the Atlantic Highway, a fifteen minute detour to East West Bakery on Butcher's Row (next to the town's covered market) will get you the finest pasties in the South West. They've won numerous awards, and as a bonus they'll annoy your Cornish mates!
**Please don't take my word for it. Check before you travel. I don't want you to blame me if you crash your car in Moldovia and 112 doesn't work.
I have plenty of Scottish friends and relatives who use hold yer wheest to tell people to be quiet (well it sounds more like howl yer wheest but I translate to to hold in my head). I doubt there is just one dialect covering all the highlands and lowlands.
When the TV program Tomorrow's World was on, every few years since the early 1990's they said 'Speech recognition has been unreliable so far but now..."
Whilst it has been improving, it seems to still have a way to go and the improvements appear to be getting smaller. I think it's going to be quite a few years before it is good enough for most people.
I think (seriously) this is more of a milestone than a breakthrough. Speech in English often is ambiguous anyway. This result may be better than before and a three and fourpence of the case for peach cognition, but it isn't an out standing a dance oh for what has all ready bean a chief. :-)
Re: " based on an account of a UK woman who used the software to dial the local police in order to report an accident and was directed to authorities in the US."
To be fair, given the yanks seem to think they have worldwide jurisdiction over everyone and everything, this was probably by design, since "why would you want to talk to any other Police force other than the US Police ?"
Back in the nineties there was a PC gaming magazine called PC Zone. As far as I can remember, the only game they awarded a score of 0% to was called GloboCop: World Police.
Genuinely, I don't know what to make of my inability to find any mention of it online. It might have been a game that enjoyed only very limited release (and PC Zone only reviewed it to take the piss).
!!!! [Just seen on Wikipedia that] Charlie Brooker wrote for PC Zone from 1995. That explains a lot. Shit, that probably explains why I'm on the Reg. Heck. I blame the Dennises (plural of Dennis, not of Denise, sadly) who gave me the first issue. Nathan Barley was a work of prophesy.
I used to be an application trainer at Dictaphone on their speach recognition software. A 1 in 10 or 1 in 20 error rate is still very high for a transcript. We had a product that got rid of the transcriptionist but most doctors don't want to deal with another piece of software and correcting their mistakes and usually what they want goes. So transcriptionists turn into editors for the speach recognition program and most are not laid off.
My wife did medical transcription for quite a while. There's been continuous talk of moving to automation for it, but it runs up against two things: Doctors' voice recordings are frequently as incoherent as their penmanship is illegible, and seemingly for the same reason - they mostly appear to care not a bit how much effort is needed to sort it out.
For random phone conversations 5% or even 10% error rates may be adequate, but for medical purposes where lives are more likely to be at stake (setting aside calls to emergency services), I daresay it'll be a while till they're in the sub-1% realm. Or in other words, we're likely to automate the doctors themselves sooner than we automate the medical transcription the fleshbag versions require.
Thirty years ago, while reading the proof of a speech given at one of the local schools, I encountered "houndsman of a stiller hound". A moment of reflection yielded A.E. Housman's "townsman of a stiller town". The rest of the transcription was on a par with that, but mostly without expressions from which one could derive a better sense. To be fair, it was probably the work of a parent volunteer, rather than a trained stenographer. As I recall, the whole thing was replaced by the speaker's own copy.
I remember playing with Microsoft's Speech SDK in the early noughties.
SAPI 5.1 I think it was, and it was pretty good at Text-to-Speech, but the Speech Recognition Engine was.. umm... interesting... to work with at that point.
It was however possible to fudge it so that what it heard was recognised - even if it wasn't what you actually said!
Speech recognition still has some way to go. Place names and technical language are still not handled well and I suspect that 'just as good' as a pro transcriber means "about as good as a really poor transcriber".
"We should migrate our lemon flutes to a hard flea."
At my place of business, when someone leaves a voice message, it is automatically transcribed to text and placed in an email. Thankfully it also comes with an attached MP3 file, since after more than two years of getting messages like this (an actual voice mail transcription I received today), I'd say Microsoft's voice recognition has a long way to go (note the proud proclamation at the bottom of the message):
Voice Mail Preview:
Hey it's am cody's on 11:20 on Wednesday I need your help with something regarding this are for deposit I can't seem to find the final results of the thermal now access us hope you can help me track that down cause I mean that information to put into the star could help me out with that would be great my extension is [redacted] thanks.
Created by Microsoft Speech Technology.
"Hey it's am cody's on 11:20 on Wednesday I need your help..."
This is just brilliant. Thanks for sharing!
On the other hand this is also a very sobering account of the current state of speech recognition in noisy environments (telephone line, bandwidth limited). What year is it again? 2016? Oh well.
The type of conversation with error rate greater than 10%--family-based--must be among the more difficult ones to interpret. First, there would be no attempt on the part of the speakers to mask any local accents or dialects. Second, they could be speaking in code. For example, when a person says "uh-huh", are they clearing their throat, or is it a meaningful contribution to the discourse, a token for a paragraph's worth of words? Third, they can refer to people by name, or by nickname, or by relationship or by creative insult. The only conversation that I think could be more challenging, would be between teenage friends.
Some years ago I heard a CBC radio interview of a newspaper reporter who developed an RSI through typing, presumably at a computer terminal. So he switched the text recognition software, best that money could buy at the time, one would assume (he was working for a top newspaper) but before long developed a vocal RSI, even more debilitating, because the software would not understand him unless he stopped briefly between each word. He took part in the interview only with some difficulty.
A final thought-sac: if they released very good OCR or speech recognition software, punters would reach a stage where they'd rarely be inspired to buy the next version or upgrade. It's a bit like Windows, where they're forever taking "one step forward, two steps back" to make your current User Experience on a par with Windows 2000 (taking into account that faster CPUs and gargantuan RAM should have improved your experience). At this point, one might well ask "so what's the excuse of [alternative family of OSes]?", but I'll put it in a more positive way, that I'm hoping they blow MS Windows out of the water on every level, before long.