Microsoft announces Phi-4-multimodal and Phi-4-mini small language models

In December 2024, Microsoft introduced Phi-4, a small language model (SLM) with state-of-the-art performance in its class. Today, Microsoft is expanding the Phi-4 family with two new models: Phi-4-multimodal and Phi-4-mini.

The new Phi-4-multimodal model supports speech, vision, and text simultaneously, while Phi-4-mini is focused on text-based tasks.

Phi-4-multimodal is a 5.6B parameter model and is also Microsoft’s first multimodal language model that integrates speech, vision, and text processing into a single, unified architecture. Compared to other existing state-of-the-art omni models, including Google’s Gemini 2.0 Flash and Gemini 2.0 Flash Lite, Phi-4-multimodal achieves better performance on multiple benchmarks, as you can see in the table below.

In speech-related tasks, Phi-4-multimodal outperforms specialized speech models like WhisperV3 and SeamlessM4T-v2-Large in both automatic speech recognition (ASR) and speech translation (ST). Microsoft states that this model has achieved the top position on the Hugging Face OpenASR leaderboard with an impressive word error rate of 6.14%.

In vision-related tasks, Phi-4-multimodal achieved strong performance in mathematics and science reasoning. In common multimodal capabilities, such as document and chart understanding, OCR, and visual science reasoning, this new model matches or exceeds popular models like Gemini-2-Flash-lite-preview and Claude-3.5-Sonnet.

Phi-4-mini is a 3.8B parameter model and outperforms several popular larger LLMs in text-based tasks, including reasoning, math, coding, instruction-following, and function-calling.

To ensure the security and safety of these new models, Microsoft conducted testing with internal and external security experts, employing strategies crafted by the Microsoft AI Red Team (AIRT). Both Phi-4-mini and Phi-4-multimodal models can be deployed on-device when further optimized with ONNX Runtime for cross-platform availability, making them suitable for low-cost and low-latency scenarios.

Both Phi-4-multimodal and Phi-4-mini models are now available for developers in Azure AI Foundry, Hugging Face, and the NVIDIA API Catalog. Developers can go through the technical paper to see an outline of recommended models uses and their limitations.

These new Phi-4 models represent significant advancements in efficient AI, bringing powerful multimodal and text-based capabilities to a variety of AI applications.