Latency is a significant issue for most LLM-related use cases. For scenarios like code suggestions and modifying long documents, latency can really affect the overall user experience. Imagine a user wanting to rewrite the last paragraph of a 2-page document. It would be better if the rewritten document appeared instantly since the change involves only a single paragraph. However, current LLM APIs require the entire document to be regenerated, causing significant latency for users.
OpenAI is now trying to solve this issue with a new developer feature called Predicted Outputs. This feature can be used in cases where most of the output of the LLM is known ahead of time. Tasks like editing documents or refactoring code can be improved using this feature. Predicted Outputs uses speculative decoding to skip over known content, making iterations much faster.
Developers can reduce latency significantly by passing in the existing content as their prediction. By doing so, they can regenerate the entire content much more quickly.
OpenAI tested this feature with some external partners, and the results were hugely positive. For example, according to the Microsoft GitHub team's internal benchmarks, Predicted Outputs in Copilot Workspace workloads led to a 5.8x speedup.
Thank you @openaidevs! We benchmarked this on Copilot Workspace workloads and measured a 5.8x speedup! 🤯 https://t.co/FOCwYJheUc
— Eddie Aftandilian (@eaftandilian) November 4, 2024
Predicted Outputs are really fast. We had a ton of fun working with @openai to help test and improve the API. Sign up for early access to Exponent and try it yourself: https://t.co/eC3XD4F3Iw https://t.co/1jUzMEARCC
— Exponent (@exponent_run) November 4, 2024
To use Predicted Outputs, there are some limitations for developers. First, it is only supported with the GPT-4o and GPT-4o-mini series of models. The latest o1 models are not supported. Also, the following existing API parameters are not supported when using Predicted Outputs:
- n values greater than 1
- logprobs
- presence_penalty greater than 0
- frequency_penalty greater than 0
- audio options
- modalities other than text
- max_completion_tokens
- tools - function calling is not supported
When providing a prediction, any tokens provided that are not part of the final completion from the API are charged at completion token rates. While limitations exist, the potential benefits of this new Predicted Outputs feature are substantial, paving the way for more responsive and efficient LLM-powered tools.
0 Comments - Add comment