LIVE
ANTHROPICOpus 4.7 benchmarks published2m ago
CLAUDEOK142ms
OPUS 4.7$15 / $75per Mtok
CHATGPTOK89ms
HACKERNEWSWhy has not AI improved design quality the way it improved dev speed?14m ago
MMLU-PROleader Opus 4.788.4
GEMINIDEGRADED312ms
MISTRALMistral Medium 3 released6m ago
GPT-4o$5 / $15per Mtok
ARXIVCompositional reasoning in LRMs22m ago
BEDROCKOK178ms
GEMINI 2.5$3.50 / $10.50per Mtok
THE VERGEFrontier Model Forum expansion announced38m ago
SWE-BENCHleader Claude Opus 4.772.1%
MISTRALOK104ms
ANTHROPICOpus 4.7 benchmarks published2m ago
CLAUDEOK142ms
OPUS 4.7$15 / $75per Mtok
CHATGPTOK89ms
HACKERNEWSWhy has not AI improved design quality the way it improved dev speed?14m ago
MMLU-PROleader Opus 4.788.4
GEMINIDEGRADED312ms
MISTRALMistral Medium 3 released6m ago
GPT-4o$5 / $15per Mtok
ARXIVCompositional reasoning in LRMs22m ago
BEDROCKOK178ms
GEMINI 2.5$3.50 / $10.50per Mtok
THE VERGEFrontier Model Forum expansion announced38m ago
SWE-BENCHleader Claude Opus 4.772.1%
MISTRALOK104ms
All benchmarks

MMLU-Pro leaderboard

MMLU-Pro is the harder successor to the original MMLU benchmark. It tests general knowledge and reasoning across 57 subjects (math, physics, law, medicine, philosophy, etc.) using multiple-choice questions designed to require multi-step reasoning rather than memorization. MMLU-Pro is the standard "is this model smart" benchmark for general-purpose use cases.

Current leader
GPT-5.5(OpenAI)94.2%

Last refreshed 2026-04-24. 18 models scored on this benchmark.

Full leaderboard

#ModelProviderScoreReleased
1GPT-5.5OpenAI94.2%2026-04
2Claude Opus 4.7Anthropic93.8%2026-04
3Claude Opus 4.6Anthropic92.4%2026-03
4o1OpenAI91.8%2025-09
5DeepSeek V4 ProDeepSeek91.5%2026-04
6Gemini 2.5 ProGoogle91.2%2026-01
7GPT-4.5OpenAI90.1%2025-12
8Llama 4 MaverickMeta89.3%2026-03
9Claude Sonnet 4.6Anthropic88.7%2026-02
10DeepSeek V3DeepSeek88.1%2025-12
11GPT-4oOpenAI87.2%2025-05
12Mistral LargeMistral86.8%2025-11
13o3-miniOpenAI86.3%2025-11
14Llama 4 ScoutMeta85.9%2026-02
15DeepSeek V4 FlashDeepSeek85.2%2026-04
16Gemini 2.0 FlashGoogle84.5%2025-10
17Claude Haiku 4.5Anthropic82.1%2026-01
18Mistral SmallMistral78.4%2025-09

Score interpretation

Scores are reported as % of questions answered correctly. The chance baseline is roughly 25% (4-choice). The 2026 frontier sits above 90%, with the strongest models in the mid-90s. A 5-point gap on MMLU-Pro is meaningful; a 1-point gap is within noise.

90%+
Frontier reasoning. Comparable to PhD-level human performance.
80-90%
Strong general assistant. Production-ready for most knowledge tasks.
60-80%
Useful for everyday queries, weak on harder reasoning.
< 60%
Below the threshold for reliable knowledge work.

Why this matters for AI agents

For general chat assistants, research synthesis, and any workload where the model needs broad knowledge plus reasoning, MMLU-Pro is the best single proxy for capability. Models that lead MMLU-Pro almost always lead other reasoning benchmarks too.

Other benchmarks

Premium API: time-series for MMLU-Pro

The leaderboard above is a snapshot. Want to see how a model's MMLU-Pro score has moved over the last 30-90 days, or set a webhook that fires when a score crosses a threshold? The premium API has both:

MMLU-Pro source ·Last refreshed 2026-04-24·Max score 100