back to article This speech recognition code is 'just as good' as a pro transcriber

Microsoft on Tuesday said that its researchers have "made a major breakthrough in speech recognition." In a paper [PDF] published a day earlier, Microsoft machine learning researchers describe how they developed an automated system that can recognize recorded speech as well as a professional transcriptionist. Using the NIST …

  1. Anonymous Coward

    Checked the transcripts

    Most of it looked good, but the conversations seemed to be suspiciously often about upgrading to Windows 10.

    1. Anonymous Coward
      Anonymous Coward

      Re: Checked the transcripts

      They mustbe able to recognise swear words really well by now then.

  2. Anonymous Coward
    Anonymous Coward

    The real test will be captioning YouTube vidoes

    Especially the ones by people narrating in a non 1st language...

    I'd bet the human transcribers would be rather better than AI bots for some time to come...

    1. Charles 9

      Re: The real test will be captioning YouTube vidoes

      Given the errors I see in those efforts, don't bet the house on that statement.

  3. Black Rat

    Feed it the 'Fork Handles' sketch and lets see how it does ;)

    1. Captain DaFt

      @ Black Rat

      I'd be happy just to see it handle the phrase, "I'll see you in Aisle C, Elsie."

      1. Charles 9

        Re: @ Black Rat

        Oh? How about "Recognize speech that I want to wreck a nice beach."?

  4. Pen-y-gors

    Dodgy numbers?

    Jolly good for M$, but I have my doubts about the numbers.

    11.3% errors for a human transcriber is appalling! Who did the transcription? Were they using human court stenographers? Audio typists? Hansard transcribers? If any of them had an 11.3% error rate they'd be out on their ear.

    I also suspect that in real world situations when people are in a 'formal' situation or know they ned to speak clearly then the numbers would be better, both for the humans and the robots.

    There are some good use cases where accurate audio transcription by software, ideally in real time, would be great - instant(-ish) translation of phone calls etc., ability to make audio files earchable. But it does need to be accurate - 99.5% and upwards, like decent OCR. Unfortunately it also means that our friends in NSA, GCHQ, the Kremlin etc can also do all these things, which is not so good!

    I think 7/10 for effort.

    1. Nick Kew

      Re: Dodgy numbers?

      11.3% errors for a human transcriber is appalling!

      You can see it in real life, where transcribed text is shown to the public. For example, look at the details at rightmove or zoopla, transcribed by some numpty at an estate agent. Some agents seem to specialise in entertainment value.

    2. Mage Silver badge

      Re: Dodgy numbers?

      Yes, it's actually garbage.

      1) The real score on real world stuff will be lower.

      2) Any competent Audio typist (that works with the same person) can beat a transcriber (remove source errors).

      3) Perhaps they are comparing a real time stenographer? Even so it's a poor score.

      Natural language parsing is the limit, it's simply nowhere near good enough to sport decent text to speech.

      Dictation transcription (aka Audio typists), transscription not in real time of unknown source, speech/Film/TV/News subtitles in real time, and live stenography / shorthand with later transscription are all different activities. All rely on UNDERSTANDING the meaning as well as basic parsing.

      This is shameless marketing.

      1. Charles 9

        Re: Dodgy numbers?

        What about live transcribing of a live event, like the closed captioning you see during sports events?

  5. Dave 126

    >"Cortana evidently can benefit from further improvement. Last month, security firm Sophos advised against relying on Cortana for making emergency calls, based on an account of a UK woman who used the software to dial the local police in order to report an accident and was directed to authorities in the US."

    What the hell is up with the Register shoe-horning in some tangentially (at best) related final paragraph into its articles?

    Cortana transcribed the UK woman's speech perfectly. She said 'Barnstaple', and it transcribed that perfectly. The issue is that it supplied a telephone number for a police force in some USA town called Barnstaple, and not the town in North Devon*. There was an issue with Cortana in this case, but it wasn't in the area of speech recognition.

    In any case, in an emergency you call 999 or 112 (works in Europe** and UK) - not the local plod's number. The emergency (999) switchboard are able to roughly triangulate your location if you are not able to describe it.

    * If you are passing Barnstaple on the Atlantic Highway, a fifteen minute detour to East West Bakery on Butcher's Row (next to the town's covered market) will get you the finest pasties in the South West. They've won numerous awards, and as a bonus they'll annoy your Cornish mates!

    **Please don't take my word for it. Check before you travel. I don't want you to blame me if you crash your car in Moldovia and 112 doesn't work.

  6. Ole Juul

    now run that through an editor

    to get rid of the 20% consisting of um and ah.

  7. Dave 126

    Bit of fun:

    A collection of missheard song lyrics. The site's name, for you philistines, comes from Jimi Hendrix's Purple Haze - 'Scuse me whilst I kiss the sky'.

    1. Nick Ryan Silver badge

      Re: Bit of fun:

      Thank you and I refuse to look at this site in case it also lists "Bill Odie, Bill Odie, put your hands over my body" and other gems. There's only so much that mind bleach can do.

  8. James 51

    Can it handle Rab in full flow? If not then they should hold their wheesht.

    1. xeroks

      This speech recognition code is 'just as good' as a pro transcriber

      "Can it handle Rab in full flow? If not then they should hold their wheesht."

      Looks like you need to upgrade your transcriber: The word is "haud".

      1. James 51

        Re: This speech recognition code is 'just as good' as a pro transcriber

        I have plenty of Scottish friends and relatives who use hold yer wheest to tell people to be quiet (well it sounds more like howl yer wheest but I translate to to hold in my head). I doubt there is just one dialect covering all the highlands and lowlands.

        1. A K Stiles

          Re: This speech recognition code is 'just as good' as a pro transcriber

          Ach, Awa' an' bile yer heid, ye lummock!

          1. James 51

            Re: This speech recognition code is 'just as good' as a pro transcriber

            See icon :)

      2. agurney

        Re: This speech recognition code is 'just as good' as a pro transcriber

  9. Anonymous Coward
    Anonymous Coward

    The clerical tasks of transcription will be automated more and more, but the simple fact remains that for important decisions sign-off by a human remains necessary.

  10. Anonymous Coward
    Anonymous Coward

    "Cortana evidently can benefit from further improvement"

    Like uninstalling it

  11. Shady

    Cortana on XBone

    Given the ONLY words my XB1 understands are "Hey Cortana" and then absolutely, steadfastly refuses to recognize *anything* after that (pre-cortana update had circa 2/3 success rate), I'll take this claim with a chunk of salt.

  12. Anonymous Coward
    Anonymous Coward

    Forward planning: that will greatly help ..

    .. with Skype intercepts.

    Or did you really think this technology was developed for your benefit?

    1. Charles 9

      Re: Forward planning: that will greatly help ..

      Then I wish them luck trying to interpret when I call out LCEDIV4A8EPTBK.

  13. Chris Evans

    They've been saying this for the last twenty+ years!

    When the TV program Tomorrow's World was on, every few years since the early 1990's they said 'Speech recognition has been unreliable so far but now..."

    Whilst it has been improving, it seems to still have a way to go and the improvements appear to be getting smaller. I think it's going to be quite a few years before it is good enough for most people.

    1. Robert Carnegie Silver badge

      Speaker-trained recognition works well with sufficient processing power and RAM - but,

      I think (seriously) this is more of a milestone than a breakthrough. Speech in English often is ambiguous anyway. This result may be better than before and a three and fourpence of the case for peach cognition, but it isn't an out standing a dance oh for what has all ready bean a chief. :-)

  14. Anonymous Coward
    Anonymous Coward

    To be fair to the woman calling local police getting the US ...

    Re: " based on an account of a UK woman who used the software to dial the local police in order to report an accident and was directed to authorities in the US."

    To be fair, given the yanks seem to think they have worldwide jurisdiction over everyone and everything, this was probably by design, since "why would you want to talk to any other Police force other than the US Police ?"

    1. Dave 126

      Re: To be fair to the woman calling local police getting the US ...

      Back in the nineties there was a PC gaming magazine called PC Zone. As far as I can remember, the only game they awarded a score of 0% to was called GloboCop: World Police.

      Genuinely, I don't know what to make of my inability to find any mention of it online. It might have been a game that enjoyed only very limited release (and PC Zone only reviewed it to take the piss).

      !!!! [Just seen on Wikipedia that] Charlie Brooker wrote for PC Zone from 1995. That explains a lot. Shit, that probably explains why I'm on the Reg. Heck. I blame the Dennises (plural of Dennis, not of Denise, sadly) who gave me the first issue. Nathan Barley was a work of prophesy.

  15. schlechtj

    No rash of unemployed transcriptionists

    I used to be an application trainer at Dictaphone on their speach recognition software. A 1 in 10 or 1 in 20 error rate is still very high for a transcript. We had a product that got rid of the transcriptionist but most doctors don't want to deal with another piece of software and correcting their mistakes and usually what they want goes. So transcriptionists turn into editors for the speach recognition program and most are not laid off.

  16. IglooDude

    I'm doubtful.

    My wife did medical transcription for quite a while. There's been continuous talk of moving to automation for it, but it runs up against two things: Doctors' voice recordings are frequently as incoherent as their penmanship is illegible, and seemingly for the same reason - they mostly appear to care not a bit how much effort is needed to sort it out.

    For random phone conversations 5% or even 10% error rates may be adequate, but for medical purposes where lives are more likely to be at stake (setting aside calls to emergency services), I daresay it'll be a while till they're in the sub-1% realm. Or in other words, we're likely to automate the doctors themselves sooner than we automate the medical transcription the fleshbag versions require.

  17. PapaD

    Dialling 911 in the UK

    IF you dial 911 in the UK its automatically redirected to 999 - this has been the case for a while now because 'foreign tourists', 'kids who watch too much American tv' and 'idiots'

  18. Hud Dunlap

    Try a true southern accent

    Lewis Grizzard on southern accents.

  19. disgruntled yank


    Thirty years ago, while reading the proof of a speech given at one of the local schools, I encountered "houndsman of a stiller hound". A moment of reflection yielded A.E. Housman's "townsman of a stiller town". The rest of the transcription was on a par with that, but mostly without expressions from which one could derive a better sense. To be fair, it was probably the work of a parent volunteer, rather than a trained stenographer. As I recall, the whole thing was replaced by the speaker's own copy.

  20. Mutton Jeff

    I bet...

    my hovercraft is still full of eels!

  21. Alister

    I remember playing with Microsoft's Speech SDK in the early noughties.

    SAPI 5.1 I think it was, and it was pretty good at Text-to-Speech, but the Speech Recognition Engine was.. umm... interesting... to work with at that point.

    It was however possible to fudge it so that what it heard was recognised - even if it wasn't what you actually said!

  22. Lotaresco


    Speech recognition still has some way to go. Place names and technical language are still not handled well and I suspect that 'just as good' as a pro transcriber means "about as good as a really poor transcriber".

    "We should migrate our lemon flutes to a hard flea."

    Dilbert Monday 5th April 2010

  23. Patched Out

    Ha Ha Ha - No really.

    At my place of business, when someone leaves a voice message, it is automatically transcribed to text and placed in an email. Thankfully it also comes with an attached MP3 file, since after more than two years of getting messages like this (an actual voice mail transcription I received today), I'd say Microsoft's voice recognition has a long way to go (note the proud proclamation at the bottom of the message):

    Voice Mail Preview:

    Hey it's am cody's on 11:20 on Wednesday I need your help with something regarding this are for deposit I can't seem to find the final results of the thermal now access us hope you can help me track that down cause I mean that information to put into the star could help me out with that would be great my extension is [redacted] thanks.

    Created by Microsoft Speech Technology.

    1. Apprentice of Tokenism
      Thumb Up

      Re: Ha Ha Ha - No really.

      "Hey it's am cody's on 11:20 on Wednesday I need your help..."

      This is just brilliant. Thanks for sharing!

      On the other hand this is also a very sobering account of the current state of speech recognition in noisy environments (telephone line, bandwidth limited). What year is it again? 2016? Oh well.

  24. GrapeBunch

    yo homies sup?

    The type of conversation with error rate greater than 10%--family-based--must be among the more difficult ones to interpret. First, there would be no attempt on the part of the speakers to mask any local accents or dialects. Second, they could be speaking in code. For example, when a person says "uh-huh", are they clearing their throat, or is it a meaningful contribution to the discourse, a token for a paragraph's worth of words? Third, they can refer to people by name, or by nickname, or by relationship or by creative insult. The only conversation that I think could be more challenging, would be between teenage friends.

    Some years ago I heard a CBC radio interview of a newspaper reporter who developed an RSI through typing, presumably at a computer terminal. So he switched the text recognition software, best that money could buy at the time, one would assume (he was working for a top newspaper) but before long developed a vocal RSI, even more debilitating, because the software would not understand him unless he stopped briefly between each word. He took part in the interview only with some difficulty.

    A final thought-sac: if they released very good OCR or speech recognition software, punters would reach a stage where they'd rarely be inspired to buy the next version or upgrade. It's a bit like Windows, where they're forever taking "one step forward, two steps back" to make your current User Experience on a par with Windows 2000 (taking into account that faster CPUs and gargantuan RAM should have improved your experience). At this point, one might well ask "so what's the excuse of [alternative family of OSes]?", but I'll put it in a more positive way, that I'm hoping they blow MS Windows out of the water on every level, before long.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like