HumanEval leaderboard
HumanEval is OpenAI's original code generation benchmark: 164 hand-written Python programming problems, each with a function signature, docstring, and unit tests. The model must produce a function body that passes all the tests. HumanEval is the simplest, most-cited code benchmark and remains a useful capability floor.
Full leaderboard
| # | Model | Provider | Score | Released |
|---|---|---|---|---|
| 1 | GPT-5.5 | OpenAI | 97.1% | 2026-04 |
| 2 | Claude Opus 4.7 | Anthropic | 96.2% | 2026-04 |
| 3 | Claude Opus 4.6 | Anthropic | 95.1% | 2026-03 |
| 4 | DeepSeek V4 Pro | DeepSeek | 94.8% | 2026-04 |
| 5 | o1 | OpenAI | 94.2% | 2025-09 |
| 6 | Gemini 2.5 Pro | 93.8% | 2026-01 | |
| 7 | GPT-4.5 | OpenAI | 93.4% | 2025-12 |
| 8 | Claude Sonnet 4.6 | Anthropic | 92% | 2026-02 |
| 9 | Llama 4 Maverick | Meta | 91.7% | 2026-03 |
| 10 | DeepSeek V3 | DeepSeek | 91.2% | 2025-12 |
| 11 | GPT-4o | OpenAI | 90.2% | 2025-05 |
| 12 | o3-mini | OpenAI | 89.7% | 2025-11 |
| 13 | DeepSeek V4 Flash | DeepSeek | 89.4% | 2026-04 |
| 14 | Mistral Large | Mistral | 89.1% | 2025-11 |
| 15 | Llama 4 Scout | Meta | 88.4% | 2026-02 |
| 16 | Gemini 2.0 Flash | 87.6% | 2025-10 | |
| 17 | Claude Haiku 4.5 | Anthropic | 86.3% | 2026-01 |
| 18 | Mistral Small | Mistral | 82.5% | 2025-09 |
Score interpretation
Scores are pass@1: percentage of problems where the model's first attempt passes all tests. The 2026 frontier is above 95%, which means the benchmark is approaching saturation. A 1-point gap at the top is within noise; the more meaningful signal is now SWE-bench.
Why this matters for AI agents
HumanEval is a fast, cheap proxy for "can the model generate correct Python from a docstring." It is no longer a frontier-level differentiator (most strong models score above 90%) but it is still the easiest sanity check for whether a model is even in the conversation for code work.
Other benchmarks
Premium API: time-series for HumanEval
The leaderboard above is a snapshot. Want to see how a model's HumanEval score has moved over the last 30-90 days, or set a webhook that fires when a score crosses a threshold? The premium API has both:
/api/premium/history/benchmarks/series?model=&benchmark=human_eval— daily score evolution for one model on this benchmark, 1 credit per call/api/premium/forecast?target=benchmark&benchmark=human_eval— 1-30 day projection with 95% prediction interval