Last year we reported on Wavenet, an AI research system developed by Google’s DeepMind which could outperform existing text-to-speech systems and even create realistic human-sounding voices from nothing. Since then, other companies have used this technology, and now, Baidu thinks it’s gotten it ready to be deployed to the market.
Yesterday the Chinese company unveiled Deep Voice, a system built on top of Google’s work that reduces the need for heavy computing and only needs a few hours of training. While existing text-to-speech systems like Siri or Cortana rely on hours and hours of pre-recorded voices, Baidu and Google’s systems can make human voices out of nothing.
The main limitation of Google’s Wavenet was how computationally intensive the system was, but Baidu’s Deep Voice takes away some of those limitations from the machine-learning system. How? By using even more machine-learning.
Deep Voice works by separating text into graphemes, the smallest written particle, translating those into phonemes, the smallest speech particle, and then relaying that info in sound. Each of these steps is handled by machine-learning algorithms, which need to perform at an incredible rate to sound realistic. The Baidu researchers explain:
To perform inference at real-time, we must take great care to never recompute any results, store the entire model in the processor cache (as opposed to main memory), and optimally utilize the available computational units
In a paper currently on the pre-print server, Baidu’s researchers believe to have cracked the key, saying their Deep Voice system performs faster than real time and is 400x faster than some existing implementations.
Unfortunately, Baidu didn’t provide us with any samples to hear for ourselves, but the company does say it believes it now has a production-quality program. This might mean we’ll soon see, or rather hear, Deep Voice implementations in actual products.
Source: arXiv