LIVE
ANTHROPICOpus 4.7 benchmarks published2m ago
CLAUDEOK142ms
OPUS 4.7$15 / $75per Mtok
CHATGPTOK89ms
HACKERNEWSWhy has not AI improved design quality the way it improved dev speed?14m ago
MMLU-PROleader Opus 4.788.4
GEMINIDEGRADED312ms
MISTRALMistral Medium 3 released6m ago
GPT-4o$5 / $15per Mtok
ARXIVCompositional reasoning in LRMs22m ago
BEDROCKOK178ms
GEMINI 2.5$3.50 / $10.50per Mtok
THE VERGEFrontier Model Forum expansion announced38m ago
SWE-BENCHleader Claude Opus 4.772.1%
MISTRALOK104ms
ANTHROPICOpus 4.7 benchmarks published2m ago
CLAUDEOK142ms
OPUS 4.7$15 / $75per Mtok
CHATGPTOK89ms
HACKERNEWSWhy has not AI improved design quality the way it improved dev speed?14m ago
MMLU-PROleader Opus 4.788.4
GEMINIDEGRADED312ms
MISTRALMistral Medium 3 released6m ago
GPT-4o$5 / $15per Mtok
ARXIVCompositional reasoning in LRMs22m ago
BEDROCKOK178ms
GEMINI 2.5$3.50 / $10.50per Mtok
THE VERGEFrontier Model Forum expansion announced38m ago
SWE-BENCHleader Claude Opus 4.772.1%
MISTRALOK104ms
All benchmarks

GPQA Diamond leaderboard

GPQA Diamond is the hardest subset of the Graduate-level Physics and Quantum questions benchmark. The 198 questions in the Diamond subset have been verified by domain experts to be difficult even for PhDs in the relevant field. Scoring well on GPQA Diamond requires multi-step scientific reasoning, not just memorization.

Current leader
GPT-5.5(OpenAI)78.3%

Last refreshed 2026-04-24. 18 models scored on this benchmark.

Full leaderboard

#ModelProviderScoreReleased
1GPT-5.5OpenAI78.3%2026-04
2Claude Opus 4.7Anthropic76.5%2026-04
3Claude Opus 4.6Anthropic74.2%2026-03
4DeepSeek V4 ProDeepSeek73.1%2026-04
5o1OpenAI72.5%2025-09
6Gemini 2.5 ProGoogle71.9%2026-01
7GPT-4.5OpenAI68.7%2025-12
8Claude Sonnet 4.6Anthropic65.8%2026-02
9Llama 4 MaverickMeta64.1%2026-03
10DeepSeek V3DeepSeek63.5%2025-12
11o3-miniOpenAI60.3%2025-11
12GPT-4oOpenAI59.1%2025-05
13DeepSeek V4 FlashDeepSeek58.7%2026-04
14Mistral LargeMistral57.3%2025-11
15Llama 4 ScoutMeta56.2%2026-02
16Gemini 2.0 FlashGoogle54.8%2025-10
17Claude Haiku 4.5Anthropic52.4%2026-01
18Mistral SmallMistral44.6%2025-09

Score interpretation

The chance baseline is 25% (4-choice). Random guessing scores ~25%, expert non-specialists score ~34%, expert specialists score ~65%. The 2026 frontier hits ~80% on the strongest reasoning models. The gap between this and MMLU-Pro is the gap between "knows facts" and "can reason from facts under pressure."

70%+
Above expert-specialist level. Frontier reasoning.
50-70%
Strong scientific reasoning, comparable to expert non-specialists.
30-50%
Better than chance, weak on multi-step inference.
< 30%
At or near chance baseline for the benchmark.

Why this matters for AI agents

GPQA Diamond is the benchmark that separates models that have memorized scientific content from models that can reason scientifically. For research agents, technical writing, and any workload involving multi-step inference over unfamiliar domains, GPQA Diamond predicts capability better than MMLU-Pro.

Other benchmarks

Premium API: time-series for GPQA Diamond

The leaderboard above is a snapshot. Want to see how a model's GPQA Diamond score has moved over the last 30-90 days, or set a webhook that fires when a score crosses a threshold? The premium API has both:

GPQA Diamond source ·Last refreshed 2026-04-24·Max score 100