When you purchase through links on our site, we may earn an affiliate commission. Here’s how it works.

Microsoft announces Phi-4-multimodal and Phi-4-mini small language models

A glowing Microsoft logo

In December 2024, Microsoft introduced Phi-4, a small language model (SLM) with state-of-the-art performance in its class. Today, Microsoft is expanding the Phi-4 family with two new models: Phi-4-multimodal and Phi-4-mini.

The new Phi-4-multimodal model supports speech, vision, and text simultaneously, while Phi-4-mini is focused on text-based tasks.

Phi-4-multimodal is a 5.6B parameter model and is also Microsoft’s first multimodal language model that integrates speech, vision, and text processing into a single, unified architecture. Compared to other existing state-of-the-art omni models, including Google’s Gemini 2.0 Flash and Gemini 2.0 Flash Lite, Phi-4-multimodal achieves better performance on multiple benchmarks, as you can see in the table below.

Microsoft

In speech-related tasks, Phi-4-multimodal outperforms specialized speech models like WhisperV3 and SeamlessM4T-v2-Large in both automatic speech recognition (ASR) and speech translation (ST). Microsoft states that this model has achieved the top position on the Hugging Face OpenASR leaderboard with an impressive word error rate of 6.14%.

Microsoft

In vision-related tasks, Phi-4-multimodal achieved strong performance in mathematics and science reasoning. In common multimodal capabilities, such as document and chart understanding, OCR, and visual science reasoning, this new model matches or exceeds popular models like Gemini-2-Flash-lite-preview and Claude-3.5-Sonnet.

Phi-4-mini is a 3.8B parameter model and outperforms several popular larger LLMs in text-based tasks, including reasoning, math, coding, instruction-following, and function-calling.

To ensure the security and safety of these new models, Microsoft conducted testing with internal and external security experts, employing strategies crafted by the Microsoft AI Red Team (AIRT). Both Phi-4-mini and Phi-4-multimodal models can be deployed on-device when further optimized with ONNX Runtime for cross-platform availability, making them suitable for low-cost and low-latency scenarios.

Both Phi-4-multimodal and Phi-4-mini models are now available for developers in Azure AI Foundry, Hugging Face, and the NVIDIA API Catalog. Developers can go through the technical paper to see an outline of recommended models uses and their limitations.

These new Phi-4 models represent significant advancements in efficient AI, bringing powerful multimodal and text-based capabilities to a variety of AI applications.

Report a problem with article
Anno 117
Next Article

Diagonal roads and buildings confirmed for Anno 117, overhauling the classic grid plans

DLSS 4 Multi Frame Generation
Previous Article

Nvidia releases DLSS 4 plugin for Unreal Engine 5, offering multi frame generation and more

Join the conversation!

Login or Sign Up to read and post a comment.

0 Comments - Add comment