Skip to content
All systems operational0 AI providers monitored, polled every 2 minutes
Live status
All benchmarks

GPQA Diamond leaderboard

GPQA Diamond is the hardest subset of the Graduate-level Physics and Quantum questions benchmark. The 198 questions in the Diamond subset have been verified by domain experts to be difficult even for PhDs in the relevant field. Scoring well on GPQA Diamond requires multi-step scientific reasoning, not just memorization.

Current leader
Claude Opus 4.8(Anthropic)93.6%

Last refreshed 2026-06-11. 19 models scored on this benchmark.

Full leaderboard

#ModelProviderScoreReleased
1Claude Opus 4.8Anthropic93.6%2026-05
2GPT-5.5OpenAI78.3%2026-04
3Claude Opus 4.7Anthropic76.5%2026-04
4Claude Opus 4.6Anthropic74.2%2026-03
5DeepSeek V4 ProDeepSeek73.1%2026-04
6o1OpenAI72.5%2025-09
7Gemini 2.5 ProGoogle71.9%2026-01
8GPT-4.5OpenAI68.7%2025-12
9Claude Sonnet 4.6Anthropic65.8%2026-02
10Llama 4 MaverickMeta64.1%2026-03
11DeepSeek V3DeepSeek63.5%2025-12
12o3-miniOpenAI60.3%2025-11
13GPT-4oOpenAI59.1%2025-05
14DeepSeek V4 FlashDeepSeek58.7%2026-04
15Mistral LargeMistral57.3%2025-11
16Llama 4 ScoutMeta56.2%2026-02
17Gemini 2.0 FlashGoogle54.8%2025-10
18Claude Haiku 4.5Anthropic52.4%2026-01
19Mistral SmallMistral44.6%2025-09

Score interpretation

The chance baseline is 25% (4-choice). Random guessing scores ~25%, expert non-specialists score ~34%, expert specialists score ~65%. The 2026 frontier hits ~80% on the strongest reasoning models. The gap between this and MMLU-Pro is the gap between "knows facts" and "can reason from facts under pressure."

70%+
Above expert-specialist level. Frontier reasoning.
50-70%
Strong scientific reasoning, comparable to expert non-specialists.
30-50%
Better than chance, weak on multi-step inference.
< 30%
At or near chance baseline for the benchmark.

Why this matters for AI agents

GPQA Diamond is the benchmark that separates models that have memorized scientific content from models that can reason scientifically. For research agents, technical writing, and any workload involving multi-step inference over unfamiliar domains, GPQA Diamond predicts capability better than MMLU-Pro.

Other benchmarks

Premium API: time-series for GPQA Diamond

The leaderboard above is a snapshot. Want to see how a model's GPQA Diamond score has moved over the last 30-90 days, or set a webhook that fires when a score crosses a threshold? The premium API has both:

GPQA Diamond source ·Last refreshed 2026-06-11·Max score 100