GPQA Diamond leaderboard
GPQA Diamond is the hardest subset of the Graduate-level Physics and Quantum questions benchmark. The 198 questions in the Diamond subset have been verified by domain experts to be difficult even for PhDs in the relevant field. Scoring well on GPQA Diamond requires multi-step scientific reasoning, not just memorization.
Full leaderboard
| # | Model | Provider | Score | Released |
|---|---|---|---|---|
| 1 | GPT-5.5 | OpenAI | 78.3% | 2026-04 |
| 2 | Claude Opus 4.7 | Anthropic | 76.5% | 2026-04 |
| 3 | Claude Opus 4.6 | Anthropic | 74.2% | 2026-03 |
| 4 | DeepSeek V4 Pro | DeepSeek | 73.1% | 2026-04 |
| 5 | o1 | OpenAI | 72.5% | 2025-09 |
| 6 | Gemini 2.5 Pro | 71.9% | 2026-01 | |
| 7 | GPT-4.5 | OpenAI | 68.7% | 2025-12 |
| 8 | Claude Sonnet 4.6 | Anthropic | 65.8% | 2026-02 |
| 9 | Llama 4 Maverick | Meta | 64.1% | 2026-03 |
| 10 | DeepSeek V3 | DeepSeek | 63.5% | 2025-12 |
| 11 | o3-mini | OpenAI | 60.3% | 2025-11 |
| 12 | GPT-4o | OpenAI | 59.1% | 2025-05 |
| 13 | DeepSeek V4 Flash | DeepSeek | 58.7% | 2026-04 |
| 14 | Mistral Large | Mistral | 57.3% | 2025-11 |
| 15 | Llama 4 Scout | Meta | 56.2% | 2026-02 |
| 16 | Gemini 2.0 Flash | 54.8% | 2025-10 | |
| 17 | Claude Haiku 4.5 | Anthropic | 52.4% | 2026-01 |
| 18 | Mistral Small | Mistral | 44.6% | 2025-09 |
Score interpretation
The chance baseline is 25% (4-choice). Random guessing scores ~25%, expert non-specialists score ~34%, expert specialists score ~65%. The 2026 frontier hits ~80% on the strongest reasoning models. The gap between this and MMLU-Pro is the gap between "knows facts" and "can reason from facts under pressure."
Why this matters for AI agents
GPQA Diamond is the benchmark that separates models that have memorized scientific content from models that can reason scientifically. For research agents, technical writing, and any workload involving multi-step inference over unfamiliar domains, GPQA Diamond predicts capability better than MMLU-Pro.
Other benchmarks
Premium API: time-series for GPQA Diamond
The leaderboard above is a snapshot. Want to see how a model's GPQA Diamond score has moved over the last 30-90 days, or set a webhook that fires when a score crosses a threshold? The premium API has both:
/api/premium/history/benchmarks/series?model=&benchmark=gpqa_diamond— daily score evolution for one model on this benchmark, 1 credit per call/api/premium/forecast?target=benchmark&benchmark=gpqa_diamond— 1-30 day projection with 95% prediction interval