MMLU-Pro leaderboard
MMLU-Pro is the harder successor to the original MMLU benchmark. It tests general knowledge and reasoning across 57 subjects (math, physics, law, medicine, philosophy, etc.) using multiple-choice questions designed to require multi-step reasoning rather than memorization. MMLU-Pro is the standard "is this model smart" benchmark for general-purpose use cases.
Full leaderboard
| # | Model | Provider | Score | Released |
|---|---|---|---|---|
| 1 | GPT-5.5 | OpenAI | 94.2% | 2026-04 |
| 2 | Claude Opus 4.7 | Anthropic | 93.8% | 2026-04 |
| 3 | Claude Opus 4.6 | Anthropic | 92.4% | 2026-03 |
| 4 | o1 | OpenAI | 91.8% | 2025-09 |
| 5 | DeepSeek V4 Pro | DeepSeek | 91.5% | 2026-04 |
| 6 | Gemini 2.5 Pro | 91.2% | 2026-01 | |
| 7 | GPT-4.5 | OpenAI | 90.1% | 2025-12 |
| 8 | Llama 4 Maverick | Meta | 89.3% | 2026-03 |
| 9 | Claude Sonnet 4.6 | Anthropic | 88.7% | 2026-02 |
| 10 | DeepSeek V3 | DeepSeek | 88.1% | 2025-12 |
| 11 | GPT-4o | OpenAI | 87.2% | 2025-05 |
| 12 | Mistral Large | Mistral | 86.8% | 2025-11 |
| 13 | o3-mini | OpenAI | 86.3% | 2025-11 |
| 14 | Llama 4 Scout | Meta | 85.9% | 2026-02 |
| 15 | DeepSeek V4 Flash | DeepSeek | 85.2% | 2026-04 |
| 16 | Gemini 2.0 Flash | 84.5% | 2025-10 | |
| 17 | Claude Haiku 4.5 | Anthropic | 82.1% | 2026-01 |
| 18 | Mistral Small | Mistral | 78.4% | 2025-09 |
Score interpretation
Scores are reported as % of questions answered correctly. The chance baseline is roughly 25% (4-choice). The 2026 frontier sits above 90%, with the strongest models in the mid-90s. A 5-point gap on MMLU-Pro is meaningful; a 1-point gap is within noise.
Why this matters for AI agents
For general chat assistants, research synthesis, and any workload where the model needs broad knowledge plus reasoning, MMLU-Pro is the best single proxy for capability. Models that lead MMLU-Pro almost always lead other reasoning benchmarks too.
Other benchmarks
Premium API: time-series for MMLU-Pro
The leaderboard above is a snapshot. Want to see how a model's MMLU-Pro score has moved over the last 30-90 days, or set a webhook that fires when a score crosses a threshold? The premium API has both:
/api/premium/history/benchmarks/series?model=&benchmark=mmlu_pro— daily score evolution for one model on this benchmark, 1 credit per call/api/premium/forecast?target=benchmark&benchmark=mmlu_pro— 1-30 day projection with 95% prediction interval