Q: What does a MATH score of < 40% mean?

Weak general math; unreliable for quantitative tasks.

Question 1

What is MATH?

Accepted Answer

The MATH benchmark consists of 12,500 competition-level mathematics problems sourced from AMC, AIME, and Putnam-style competitions. Each problem requires multi-step algebraic, geometric, or combinatorial reasoning, and the answer must match exactly (no partial credit). MATH is one of the toughest standardized math benchmarks for LLMs.

Question 2

Which AI model leads the MATH leaderboard?

Accepted Answer

As of 2026-04-24, GPT-5.5 from OpenAI leads the MATH leaderboard with a score of 95.8%. The full ranked list of 18 models is on this page, updated as we ingest new scores.

Question 3

How is MATH scored?

Accepted Answer

Scores are exact-match accuracy on the test set. As of 2026 the frontier is in the mid-90s, but the variance between problem categories is high: most models do well on AMC-level algebra and worse on AIME-level combinatorics or proof-style problems.

Question 4

Why does MATH matter for AI agents?

Accepted Answer

MATH performance correlates strongly with multi-step reasoning capability in general. Models that can carry algebraic state through 5-10 steps on MATH problems tend to be the same models that can carry argumentative state through long agent workflows. If your agent does any quantitative work, MATH is a useful proxy.

Question 5

What does a MATH score of 90%+ mean?

Accepted Answer

Frontier. Solves most competition-level problems.

Question 6

What does a MATH score of 70-90% mean?

Accepted Answer

Strong on AMC-level, struggles on AIME/Putnam.

Question 7

What does a MATH score of 40-70% mean?

Accepted Answer

Useful for routine math but unreliable on multi-step problems.

Question 8

What does a MATH score of < 40% mean?

Accepted Answer

Weak general math; unreliable for quantitative tasks.

#	Model	Provider	Score	Released
1	GPT-5.5	OpenAI	95.8%	2026-04
2	o1	OpenAI	94.6%	2025-09
3	Claude Opus 4.7	Anthropic	93.1%	2026-04
4	DeepSeek V4 Pro	DeepSeek	92.4%	2026-04
5	Claude Opus 4.6	Anthropic	91.8%	2026-03
6	Gemini 2.5 Pro	Google	90.5%	2026-01
7	GPT-4.5	OpenAI	88.2%	2025-12
8	o3-mini	OpenAI	87.1%	2025-11
9	Llama 4 Maverick	Meta	86.7%	2026-03
10	DeepSeek V3	DeepSeek	85.9%	2025-12
11	Claude Sonnet 4.6	Anthropic	85.4%	2026-02
12	DeepSeek V4 Flash	DeepSeek	82.1%	2026-04
13	GPT-4o	OpenAI	81.3%	2025-05
14	Mistral Large	Mistral	80.4%	2025-11
15	Llama 4 Scout	Meta	79.8%	2026-02
16	Gemini 2.0 Flash	Google	77.2%	2025-10
17	Claude Haiku 4.5	Anthropic	74.6%	2026-01
18	Mistral Small	Mistral	68.9%	2025-09

MATH leaderboard

Full leaderboard

Score interpretation

Why this matters for AI agents

Other benchmarks

Premium API: time-series for MATH