Q: What does a GPQA Diamond score of < 30% mean?

At or near chance baseline for the benchmark.

Question 1

What is GPQA Diamond?

Accepted Answer

GPQA Diamond is the hardest subset of the Graduate-level Physics and Quantum questions benchmark. The 198 questions in the Diamond subset have been verified by domain experts to be difficult even for PhDs in the relevant field. Scoring well on GPQA Diamond requires multi-step scientific reasoning, not just memorization.

Question 2

Which AI model leads the GPQA Diamond leaderboard?

Accepted Answer

As of 2026-04-24, GPT-5.5 from OpenAI leads the GPQA Diamond leaderboard with a score of 78.3%. The full ranked list of 18 models is on this page, updated as we ingest new scores.

Question 3

How is GPQA Diamond scored?

Accepted Answer

The chance baseline is 25% (4-choice). Random guessing scores ~25%, expert non-specialists score ~34%, expert specialists score ~65%. The 2026 frontier hits ~80% on the strongest reasoning models. The gap between this and MMLU-Pro is the gap between "knows facts" and "can reason from facts under pressure."

Question 4

Why does GPQA Diamond matter for AI agents?

Accepted Answer

GPQA Diamond is the benchmark that separates models that have memorized scientific content from models that can reason scientifically. For research agents, technical writing, and any workload involving multi-step inference over unfamiliar domains, GPQA Diamond predicts capability better than MMLU-Pro.

Question 5

What does a GPQA Diamond score of 70%+ mean?

Accepted Answer

Above expert-specialist level. Frontier reasoning.

Question 6

What does a GPQA Diamond score of 50-70% mean?

Accepted Answer

Strong scientific reasoning, comparable to expert non-specialists.

Question 7

What does a GPQA Diamond score of 30-50% mean?

Accepted Answer

Better than chance, weak on multi-step inference.

Question 8

What does a GPQA Diamond score of < 30% mean?

Accepted Answer

At or near chance baseline for the benchmark.

#	Model	Provider	Score	Released
1	GPT-5.5	OpenAI	78.3%	2026-04
2	Claude Opus 4.7	Anthropic	76.5%	2026-04
3	Claude Opus 4.6	Anthropic	74.2%	2026-03
4	DeepSeek V4 Pro	DeepSeek	73.1%	2026-04
5	o1	OpenAI	72.5%	2025-09
6	Gemini 2.5 Pro	Google	71.9%	2026-01
7	GPT-4.5	OpenAI	68.7%	2025-12
8	Claude Sonnet 4.6	Anthropic	65.8%	2026-02
9	Llama 4 Maverick	Meta	64.1%	2026-03
10	DeepSeek V3	DeepSeek	63.5%	2025-12
11	o3-mini	OpenAI	60.3%	2025-11
12	GPT-4o	OpenAI	59.1%	2025-05
13	DeepSeek V4 Flash	DeepSeek	58.7%	2026-04
14	Mistral Large	Mistral	57.3%	2025-11
15	Llama 4 Scout	Meta	56.2%	2026-02
16	Gemini 2.0 Flash	Google	54.8%	2025-10
17	Claude Haiku 4.5	Anthropic	52.4%	2026-01
18	Mistral Small	Mistral	44.6%	2025-09

GPQA Diamond leaderboard

Full leaderboard

Score interpretation

Why this matters for AI agents

Other benchmarks

Premium API: time-series for GPQA Diamond