Databricks has outlined a new fine-tuning method it has developed for large language models that can reduce costs and time taken for businesses. Test-time Adaptive Optimization, or TAO for short, allows you to use unlabeled data for fine tuning, which is a big time saver, and delivers better performance over traditional methods. The cost savings are mainly achieved through businesses not having to pay an actual person to sit and label the data.
The data intelligence company has been using its new fine-tuning method with Meta’s Llama language models. What it found when running three enterprise benchmarks (FinanceBench, DB Enterprise Arena, BIRD-SQL), was that with TAO fine tuning, the models were able to perform better, sometimes much better, and sometimes better than OpenAI’s GPT-4o and o3-mini.
The TAO fine-tuning method relies on test-time compute and reinforcement learning to improve models. Everyone who has ever used AI has experienced test-time compute without even knowing it. When you give a prompt to AI and it generates and output, the computational cost of doing this is called test-time compute.
Instead of using labeled data, the TAO method uses test-time compute to have models explore plausible responses for a task and then uses reinforcement learning to update LLMs based on evaluating the responses. This highly automated approach means that a human no longer has to sit there labeling data.
For cost-conscious businesses out there wondering about the ongoing costs of using the TAO method for fine tuning models, Databricks says that the only compute intensive part is during the initial training. The resulting model costs the same to run as the original.

TAO fine-tuned models were found to be better at document question answering and SQL generation. In the FinanceBench benchmark, the model had to answer 7,200 synthetic questions about SEC documents. The TAO Llama 3.3 70B model scored 85.1 compared to 82.7 without fine-tuning and 81.1 on the labeled fine tuning. In comparison, o3-mini, OpenAI’s best performing model, scored 82.2.
In the BIRD-SQL benchmark, the TAO Llama 3.3 70B scored 56.1 compared to 54.9 on the fine tuning model with labels and GPT-4o’s score of 58.1. These results clearly show that TAO fine-tuning is notable in its improvement over labeled fine tuning, and for the fact that it’s helping open models catch up with the better-performing closed models from OpenAI.
Databricks emphasizes that TAO offers ongoing improvement potential. The more you use models, the more outputs you have to train on in future fine-tuning rounds. These continuous improvements could make future fine-tuned models very powerful and even more useful.
Databricks customers have already started using TAO on Llama in a private preview. To take part, it has a form that you can apply through.
0 Comments - Add comment