Back in May, when OpenAI announced its new flagship frontier model GPT-4o ("o" for "omni"), the model"s audio understanding capabilities were prominently highlighted. The GPT-4o model can respond to audio inputs with an average of 320 milliseconds, which is similar to human response time in a typical conversation. OpenAI also announced that ChatGPT"s Voice Mode capability would leverage the GPT-4o model"s audio capabilities to deliver a seamless voice conversation experience for users.
The OpenAI team wrote the following regarding the voice capabilities of GPT-4o:
"With GPT-4o, we trained a single new model end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network. Because GPT-4o is our first model combining all of these modalities, we are still just scratching the surface of exploring what the model can do and its limitations."
In June, OpenAI announced that the advanced Voice Mode, which was planned for alpha rollout to a small group of ChatGPT Plus users in late June, would be delayed by a month. OpenAI mentioned that more time is needed to improve the model"s ability to detect and refuse certain content. Additionally, it was preparing its infrastructure to scale to millions of users while maintaining real-time responses.
Today, Sam Altman, CEO of OpenAI, confirmed via X that the Voice Mode alpha rollout will start next week for ChatGPT Plus subscribers.
The current Voice Mode in ChatGPT is not intuitive due to significant latencies of 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4) on average. The upcoming advanced Voice Mode based on GPT-4o will allow ChatGPT subscribers to engage in seamless conversations without any delay.
On a related note, OpenAI today revealed SearchGPT, its much-awaited take on the web search experience. SearchGPT is a prototype for now, and it offers AI search features that give you fast and timely answers from clear and relevant sources. You can read more about it here.
Source: Sam Altman