
The WaveNet samples are good - but the enunciation feels too evenly spaced. Possibly a real person would be taking small breaths throughout that would vary the delivery..
Google has figured out how to use artificial intelligence to make robot sounds more human, according to a new paper. Using its “WaveNet” model, Google’s AI company in the UK, DeepMind, claims to have created a natural machine-to-human speech that halves “the gap with human performance." Machine babble often sounds emotionally …
On the other hand, it's still pretty impressive. Especially compared to what we normally hear out in the real world with fairly poor TTS systems. I think I could quite comfortably listen to to documents or books being read to me by that wavenet voice.
I've tried converting ebooks to audiobooks for i the car and it's listenable. But only just. You have to spend time looking for "odd" words and adding phonetic corrections, especially peoples names or place names although those corrections do build up the library so I suppose if I convert more, then only names will be a problem eventually. (Yes, it's mainly SF so often "alien" names which I'm not even sure myself how they ought to be pronounced anyway :-))
If i woz gonna talk to an AI then i wood much prefer it if it sounded like a Cylon from the original Battlestar Galactica tv show rather than Robert Vaughn's Proteus IV in Demon Seed. An AI should sound like an AI and there should be laws to make these fuckers sound like Daleks and not Peter Mandelson. Its one thing to be talked down two bye a human being, its another to bee talked down too by a fuckin glorified 'Speak n Spell'.
This post has been deleted by its author
"Machine babble often sounds emotionally flat and robotic because it’s difficult to capture the natural nuances of human speech."
Absolute Nonsense. The issue is parsing and then understanding the text so as to decide how to nuance the speech, punctuation is a weak clue as to speed, pauses, pitch.
We have been able to create natural speech for maybe more than 40 years, IF there is a special marked up text with meta tags for pitch, loudness, speed, pauses. Ordinary text has to be parsed and understood.
Even actors etc, read text better if they are familiar with it and understand it. Otherwise even humans reading unfamiliar text can sound rubbish.
Google Translate is very "successful" in some respects, but it killed computer speech/language/translation development because it's not "intelligent" at all, it's a brute force "Rosetta Stone" / "Code breaker" approach rather than translation by parsing an entire section, understanding it and paraphrasing.
"Even actors etc, read text better if they are familiar with it and understand it. Otherwise even humans reading unfamiliar text can sound rubbish."
I'm no actor but I can read an unfamiliar piece and make it enjoyable for a listener. It is a bit of a skill and I am not the best but I'm quite good. I read ahead of my speech and moderate my tone accordingly. I will make a few mistakes but the overall effect is pretty good (if I do say so myself!) It's called story-telling. I can even manage to do it on the fly, without a script.
One day Goog int al will get good at this stuff but not yet. I remember when CGI in films was frankly a bit wank but nowadays it is getting to the point where it is really hard to spot the seams.
OTOH my router pings 8.8.8.8 and 8.8.4.4 on a regular basis to determine connectivity. It sends tiny ICMP packets out to do this every few seconds. Each packet is perfectly formed and contains a shit load of parameters - src/dst etc. It does this without complaining about its dodgy back, day in, day out. I can't do that.
Attempting to make silicon emulate a tiny function that the billions of neurons in my bonce can manage is bloody stupid. My head is also much smaller than any server class system. Fart around with quantum stuff if they like but that ain't going to go much further than say prime factoring, I suspect. You'll be needing a new technology to do something that makes me feel really inadequate.
"You'll be needing a new technology to do something that makes me feel really inadequate."
Well, it seems like sexbots are already under develepment.
http://www.theregister.co.uk/2016/09/06/should_humanity_hump_robots_serious_question_feature/
http://forums.theregister.co.uk/forum/1/2016/09/06/should_humanity_hump_robots_serious_question_feature/
Just like if I had an android I wouldn't want it to look indistinguishable from a human. Even if it were possible, I'd want it oddly colored like Data from Star Trek, or that Trump robot running for president, so you know the difference.
I'm sure Google wants machines that sound human, so they can get companies to outsource their call centers to Google's server farms - more income for Google, too bad about all the people who lose their jobs. They could even lie and say you have to wait for the next available operator to take your call, and sell ad space for that downtime instead of playing Muzak. I'll bet if you do a patent search, you'll find they've already patented that!
This post has been deleted by its author
I recently heard some Japanese-like speech samples generated by a deep learning algorithm. In fact, a Japanese colleague played me two samples, one of real Japanese from the training set and one of generated burble. I really couldn't tell which one was real speech. (This may have been partly because the training set was drawn from those weird manga cartoon voices.)
That is really impressive, especially given the use of neural networks that will give it more room to evolve. As other commenters have pointed out, the rhythms and timing are still a bit off. I guess that's because these things are largely context sensitive in natural human speech. Still, there are rules that can be followed (and perhaps they already try to) involving the cadences and timings of different types of words following each other (nouns / verbs etc.), along with sentence / paragraph structure and amount of breath available before needing to pause.