LIVE
ANTHROPICOpus 4.7 benchmarks published2m ago
CLAUDEOK142ms
OPUS 4.7$15 / $75per Mtok
CHATGPTOK89ms
HACKERNEWSWhy has not AI improved design quality the way it improved dev speed?14m ago
MMLU-PROleader Opus 4.788.4
GEMINIDEGRADED312ms
MISTRALMistral Medium 3 released6m ago
GPT-4o$5 / $15per Mtok
ARXIVCompositional reasoning in LRMs22m ago
BEDROCKOK178ms
GEMINI 2.5$3.50 / $10.50per Mtok
THE VERGEFrontier Model Forum expansion announced38m ago
SWE-BENCHleader Claude Opus 4.772.1%
MISTRALOK104ms
ANTHROPICOpus 4.7 benchmarks published2m ago
CLAUDEOK142ms
OPUS 4.7$15 / $75per Mtok
CHATGPTOK89ms
HACKERNEWSWhy has not AI improved design quality the way it improved dev speed?14m ago
MMLU-PROleader Opus 4.788.4
GEMINIDEGRADED312ms
MISTRALMistral Medium 3 released6m ago
GPT-4o$5 / $15per Mtok
ARXIVCompositional reasoning in LRMs22m ago
BEDROCKOK178ms
GEMINI 2.5$3.50 / $10.50per Mtok
THE VERGEFrontier Model Forum expansion announced38m ago
SWE-BENCHleader Claude Opus 4.772.1%
MISTRALOK104ms
All benchmarks

HumanEval leaderboard

HumanEval is OpenAI's original code generation benchmark: 164 hand-written Python programming problems, each with a function signature, docstring, and unit tests. The model must produce a function body that passes all the tests. HumanEval is the simplest, most-cited code benchmark and remains a useful capability floor.

Current leader
GPT-5.5(OpenAI)97.1%

Last refreshed 2026-04-24. 18 models scored on this benchmark.

Full leaderboard

#ModelProviderScoreReleased
1GPT-5.5OpenAI97.1%2026-04
2Claude Opus 4.7Anthropic96.2%2026-04
3Claude Opus 4.6Anthropic95.1%2026-03
4DeepSeek V4 ProDeepSeek94.8%2026-04
5o1OpenAI94.2%2025-09
6Gemini 2.5 ProGoogle93.8%2026-01
7GPT-4.5OpenAI93.4%2025-12
8Claude Sonnet 4.6Anthropic92%2026-02
9Llama 4 MaverickMeta91.7%2026-03
10DeepSeek V3DeepSeek91.2%2025-12
11GPT-4oOpenAI90.2%2025-05
12o3-miniOpenAI89.7%2025-11
13DeepSeek V4 FlashDeepSeek89.4%2026-04
14Mistral LargeMistral89.1%2025-11
15Llama 4 ScoutMeta88.4%2026-02
16Gemini 2.0 FlashGoogle87.6%2025-10
17Claude Haiku 4.5Anthropic86.3%2026-01
18Mistral SmallMistral82.5%2025-09

Score interpretation

Scores are pass@1: percentage of problems where the model's first attempt passes all tests. The 2026 frontier is above 95%, which means the benchmark is approaching saturation. A 1-point gap at the top is within noise; the more meaningful signal is now SWE-bench.

95%+
Saturation. Essentially solves the benchmark.
85-95%
Strong code generation across common patterns.
70-85%
Useful for assisted coding, makes more mistakes.
< 70%
Not recommended for production code work.

Why this matters for AI agents

HumanEval is a fast, cheap proxy for "can the model generate correct Python from a docstring." It is no longer a frontier-level differentiator (most strong models score above 90%) but it is still the easiest sanity check for whether a model is even in the conversation for code work.

Other benchmarks

Premium API: time-series for HumanEval

The leaderboard above is a snapshot. Want to see how a model's HumanEval score has moved over the last 30-90 days, or set a webhook that fires when a score crosses a threshold? The premium API has both:

HumanEval source ·Last refreshed 2026-04-24·Max score 100