Q: What does a MMLU-Pro score of < 60% mean?

Below the threshold for reliable knowledge work.

Question 1

What is MMLU-Pro?

Accepted Answer

MMLU-Pro is the harder successor to the original MMLU benchmark. It tests general knowledge and reasoning across 57 subjects (math, physics, law, medicine, philosophy, etc.) using multiple-choice questions designed to require multi-step reasoning rather than memorization. MMLU-Pro is the standard "is this model smart" benchmark for general-purpose use cases.

Question 2

Which AI model leads the MMLU-Pro leaderboard?

Accepted Answer

As of 2026-04-24, GPT-5.5 from OpenAI leads the MMLU-Pro leaderboard with a score of 94.2%. The full ranked list of 18 models is on this page, updated as we ingest new scores.

Question 3

How is MMLU-Pro scored?

Accepted Answer

Scores are reported as % of questions answered correctly. The chance baseline is roughly 25% (4-choice). The 2026 frontier sits above 90%, with the strongest models in the mid-90s. A 5-point gap on MMLU-Pro is meaningful; a 1-point gap is within noise.

Question 4

Why does MMLU-Pro matter for AI agents?

Accepted Answer

For general chat assistants, research synthesis, and any workload where the model needs broad knowledge plus reasoning, MMLU-Pro is the best single proxy for capability. Models that lead MMLU-Pro almost always lead other reasoning benchmarks too.

Question 5

What does a MMLU-Pro score of 90%+ mean?

Accepted Answer

Frontier reasoning. Comparable to PhD-level human performance.

Question 6

What does a MMLU-Pro score of 80-90% mean?

Accepted Answer

Strong general assistant. Production-ready for most knowledge tasks.

Question 7

What does a MMLU-Pro score of 60-80% mean?

Accepted Answer

Useful for everyday queries, weak on harder reasoning.

Question 8

What does a MMLU-Pro score of < 60% mean?

Accepted Answer

Below the threshold for reliable knowledge work.

#	Model	Provider	Score	Released
1	GPT-5.5	OpenAI	94.2%	2026-04
2	Claude Opus 4.7	Anthropic	93.8%	2026-04
3	Claude Opus 4.6	Anthropic	92.4%	2026-03
4	o1	OpenAI	91.8%	2025-09
5	DeepSeek V4 Pro	DeepSeek	91.5%	2026-04
6	Gemini 2.5 Pro	Google	91.2%	2026-01
7	GPT-4.5	OpenAI	90.1%	2025-12
8	Llama 4 Maverick	Meta	89.3%	2026-03
9	Claude Sonnet 4.6	Anthropic	88.7%	2026-02
10	DeepSeek V3	DeepSeek	88.1%	2025-12
11	GPT-4o	OpenAI	87.2%	2025-05
12	Mistral Large	Mistral	86.8%	2025-11
13	o3-mini	OpenAI	86.3%	2025-11
14	Llama 4 Scout	Meta	85.9%	2026-02
15	DeepSeek V4 Flash	DeepSeek	85.2%	2026-04
16	Gemini 2.0 Flash	Google	84.5%	2025-10
17	Claude Haiku 4.5	Anthropic	82.1%	2026-01
18	Mistral Small	Mistral	78.4%	2025-09

MMLU-Pro leaderboard

Full leaderboard

Score interpretation

Why this matters for AI agents

Other benchmarks

Premium API: time-series for MMLU-Pro