Apple's open-source LLM model struggles to match the performance of Microsoft's Phi-3

In April of this year, Microsoft announced the Phi-3 family of small language models (SLMs). The Phi-3 models significantly outperformed models of the same and larger sizes on key benchmarks. In fact, the smallest model, Phi-3-mini, outperforms models twice its size, while Phi-3-small and Phi-3-medium outperform larger models like GPT-3.5 Turbo.

Recently, Apple's DataComp for Language Models (DCLM) team released a new open-source model called DCLM-7B under the Apple Sample Code License. This new DCLM-7B is a 7-billion-parameter language model trained on the DCLM-Baseline dataset. To make the model broadly useful for various common tasks, including math and coding, Apple combined its 3.8T DCLM-Baseline with the StarCoder and ProofPile2 data to arrive at a 4.1T token dataset.

Apple created this model to highlight the effectiveness of systematic data curation techniques for improving the performance of language models. Apple also published the evaluation results of DCLM-7B along with comparisons to other similarly sized models, which you can see below.

As you can notice from the benchmark comparison table above, Microsoft's Phi-3 outperforms Apple's DCLM-7B in all three categories, including MMLU. Another surprising fact is that Apple didn't mention the specific Phi-3 model used for this comparison. Based on the MMLU score, we can infer that this score belongs to Phi-3 mini, a 3.8B language model. It's unclear why Apple compared its 7B model with a 3.8B model from Microsoft. Ideally, they should have compared it against Phi-3 Small, which is a 7B model with an impressive MMLU score of 75.6.

The race to develop high-performing small language models is clearly accelerating. While Microsoft's Phi-3 has set a high bar, Apple's DCLM-7B demonstrates the potential of focused data curation for model improvement. It remains to be seen how these small language models will evolve and impact the wider AI landscape.

Source: HuggingFace