Microsoft has introduced the next iteration of its lightweight artificial intelligence (AI) model, called Phi-3. The updated family includes the 3.8-billion-parameter Phi-3 Mini, the 7-billion-parameter Phi-3 Small, and the 14-billion-parameter Phi-3 Medium.
This release comes after the Phi-2 model, introduced in December 2023, was surpassed in performance by models such as Meta"s Llama-3 family. In the face of increased competition, Microsoft Research has applied newer techniques to its curriculum learning approach.
The new 3.8 billion parameter model improves on the previous Phi-2 model while using significantly fewer resources than larger language models. At just 3.8 billion parameters, Phi-3 Mini outperforms both Meta"s 8 billion parameter Llama and OpenAI"s 3.5 billion parameter GPT-3, according to Microsoft"s own benchmarks.
We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone.
We also provide some initial parameter-scaling results with a 7B and 14B models trained for 4.8T tokens, called phi-3-small and phi-3-medium, both significantly more capable than phi-3-mini (e.g., respectively 75% and 78% on MMLU, and 8.7 and 8.9 on MT-bench).
Due to its smaller size, the Phi-3 family is optimized for low-power devices compared to larger models. Microsoft Vice President Eric Boyd said (via The Verge) that the new model is capable of advanced natural language processing directly on a smartphone. This makes Phi-3 Mini well-suited for novel applications that require AI assistance anywhere.
While Phi-3 Mini outperforms competitors in its weight class, it cannot match the breadth of knowledge of massive models trained on the Internet. However, Boyd notes that smaller, high-quality models tend to perform better because internal datasets are often more limited in scale.