In October 2016, Microsoft’s speech and dialog research group announced it had achieved human-like speech recognition by managing to get a 6.3 percent word error rate. (WER) in the Switchboard test. But five months later, in March 2017, IBM got a 5.5 percent WER and announced it had also determined that human parity is at an even lower threshold, at 5.1 percent.
Ten months after achieving the 6.3 percent WER mark and five months after IBM set the new threshold for human-like speech recognition, a Microsoft research team has finally reached it on August 21st, also beating IBM"s latest score.
In order to achieve the 5.1 percent WER, Microsoft has used a series of improvements to its neural net-based acoustic and language models, including:
- an additional CNN-BLSTM (convolutional neural network combined with bi-directional long-short-term memory) model for improved acoustic modeling
- the update of its approach to the combination of predictions from multiple acoustic models, now at both the frame/senone and word levels
- the update of the recognizer’s language model, now able to use the entire history of a dialog session to predict what is likely to come next, which allows it to adapt to the topic and local context of a conversation
- the use of Microsoft Cognitive Toolkit 2.1 (CNTK), a scalable deep learning software, which allowed the optimization of the model"s hyperparameters
- the use of Azure GPUs, which decreased the time needed for training models and testing new ideas.
Finally, Microsoft has recognized the great work done in speech recognition by research groups in the industry and academia, and their importance for the company"s progress in the field. Furthermore, it has highlighted some challenges that still need to be addressed, such as achieving human levels of recognition in noisy environments with distant microphones and recognizing accented speech, speaking styles or languages that only have limited training data available.
Source: Microsoft Research