LIVE
ANTHROPICOpus 4.7 benchmarks published2m ago
CLAUDEOK142ms
OPUS 4.7$15 / $75per Mtok
CHATGPTOK89ms
HACKERNEWSWhy has not AI improved design quality the way it improved dev speed?14m ago
MMLU-PROleader Opus 4.788.4
GEMINIDEGRADED312ms
MISTRALMistral Medium 3 released6m ago
GPT-4o$5 / $15per Mtok
ARXIVCompositional reasoning in LRMs22m ago
BEDROCKOK178ms
GEMINI 2.5$3.50 / $10.50per Mtok
THE VERGEFrontier Model Forum expansion announced38m ago
SWE-BENCHleader Claude Opus 4.772.1%
MISTRALOK104ms
ANTHROPICOpus 4.7 benchmarks published2m ago
CLAUDEOK142ms
OPUS 4.7$15 / $75per Mtok
CHATGPTOK89ms
HACKERNEWSWhy has not AI improved design quality the way it improved dev speed?14m ago
MMLU-PROleader Opus 4.788.4
GEMINIDEGRADED312ms
MISTRALMistral Medium 3 released6m ago
GPT-4o$5 / $15per Mtok
ARXIVCompositional reasoning in LRMs22m ago
BEDROCKOK178ms
GEMINI 2.5$3.50 / $10.50per Mtok
THE VERGEFrontier Model Forum expansion announced38m ago
SWE-BENCHleader Claude Opus 4.772.1%
MISTRALOK104ms
All benchmarks

MATH leaderboard

The MATH benchmark consists of 12,500 competition-level mathematics problems sourced from AMC, AIME, and Putnam-style competitions. Each problem requires multi-step algebraic, geometric, or combinatorial reasoning, and the answer must match exactly (no partial credit). MATH is one of the toughest standardized math benchmarks for LLMs.

Current leader
GPT-5.5(OpenAI)95.8%

Last refreshed 2026-04-24. 18 models scored on this benchmark.

Full leaderboard

#ModelProviderScoreReleased
1GPT-5.5OpenAI95.8%2026-04
2o1OpenAI94.6%2025-09
3Claude Opus 4.7Anthropic93.1%2026-04
4DeepSeek V4 ProDeepSeek92.4%2026-04
5Claude Opus 4.6Anthropic91.8%2026-03
6Gemini 2.5 ProGoogle90.5%2026-01
7GPT-4.5OpenAI88.2%2025-12
8o3-miniOpenAI87.1%2025-11
9Llama 4 MaverickMeta86.7%2026-03
10DeepSeek V3DeepSeek85.9%2025-12
11Claude Sonnet 4.6Anthropic85.4%2026-02
12DeepSeek V4 FlashDeepSeek82.1%2026-04
13GPT-4oOpenAI81.3%2025-05
14Mistral LargeMistral80.4%2025-11
15Llama 4 ScoutMeta79.8%2026-02
16Gemini 2.0 FlashGoogle77.2%2025-10
17Claude Haiku 4.5Anthropic74.6%2026-01
18Mistral SmallMistral68.9%2025-09

Score interpretation

Scores are exact-match accuracy on the test set. As of 2026 the frontier is in the mid-90s, but the variance between problem categories is high: most models do well on AMC-level algebra and worse on AIME-level combinatorics or proof-style problems.

90%+
Frontier. Solves most competition-level problems.
70-90%
Strong on AMC-level, struggles on AIME/Putnam.
40-70%
Useful for routine math but unreliable on multi-step problems.
< 40%
Weak general math; unreliable for quantitative tasks.

Why this matters for AI agents

MATH performance correlates strongly with multi-step reasoning capability in general. Models that can carry algebraic state through 5-10 steps on MATH problems tend to be the same models that can carry argumentative state through long agent workflows. If your agent does any quantitative work, MATH is a useful proxy.

Other benchmarks

Premium API: time-series for MATH

The leaderboard above is a snapshot. Want to see how a model's MATH score has moved over the last 30-90 days, or set a webhook that fires when a score crosses a threshold? The premium API has both:

MATH source ·Last refreshed 2026-04-24·Max score 100