MATH leaderboard
The MATH benchmark consists of 12,500 competition-level mathematics problems sourced from AMC, AIME, and Putnam-style competitions. Each problem requires multi-step algebraic, geometric, or combinatorial reasoning, and the answer must match exactly (no partial credit). MATH is one of the toughest standardized math benchmarks for LLMs.
Full leaderboard
| # | Model | Provider | Score | Released |
|---|---|---|---|---|
| 1 | GPT-5.5 | OpenAI | 95.8% | 2026-04 |
| 2 | o1 | OpenAI | 94.6% | 2025-09 |
| 3 | Claude Opus 4.7 | Anthropic | 93.1% | 2026-04 |
| 4 | DeepSeek V4 Pro | DeepSeek | 92.4% | 2026-04 |
| 5 | Claude Opus 4.6 | Anthropic | 91.8% | 2026-03 |
| 6 | Gemini 2.5 Pro | 90.5% | 2026-01 | |
| 7 | GPT-4.5 | OpenAI | 88.2% | 2025-12 |
| 8 | o3-mini | OpenAI | 87.1% | 2025-11 |
| 9 | Llama 4 Maverick | Meta | 86.7% | 2026-03 |
| 10 | DeepSeek V3 | DeepSeek | 85.9% | 2025-12 |
| 11 | Claude Sonnet 4.6 | Anthropic | 85.4% | 2026-02 |
| 12 | DeepSeek V4 Flash | DeepSeek | 82.1% | 2026-04 |
| 13 | GPT-4o | OpenAI | 81.3% | 2025-05 |
| 14 | Mistral Large | Mistral | 80.4% | 2025-11 |
| 15 | Llama 4 Scout | Meta | 79.8% | 2026-02 |
| 16 | Gemini 2.0 Flash | 77.2% | 2025-10 | |
| 17 | Claude Haiku 4.5 | Anthropic | 74.6% | 2026-01 |
| 18 | Mistral Small | Mistral | 68.9% | 2025-09 |
Score interpretation
Scores are exact-match accuracy on the test set. As of 2026 the frontier is in the mid-90s, but the variance between problem categories is high: most models do well on AMC-level algebra and worse on AIME-level combinatorics or proof-style problems.
Why this matters for AI agents
MATH performance correlates strongly with multi-step reasoning capability in general. Models that can carry algebraic state through 5-10 steps on MATH problems tend to be the same models that can carry argumentative state through long agent workflows. If your agent does any quantitative work, MATH is a useful proxy.
Other benchmarks
Premium API: time-series for MATH
The leaderboard above is a snapshot. Want to see how a model's MATH score has moved over the last 30-90 days, or set a webhook that fires when a score crosses a threshold? The premium API has both:
/api/premium/history/benchmarks/series?model=&benchmark=math— daily score evolution for one model on this benchmark, 1 credit per call/api/premium/forecast?target=benchmark&benchmark=math— 1-30 day projection with 95% prediction interval