Q: What does a HumanEval score of < 70% mean?

Not recommended for production code work.

Question 1

What is HumanEval?

Accepted Answer

HumanEval is OpenAI's original code generation benchmark: 164 hand-written Python programming problems, each with a function signature, docstring, and unit tests. The model must produce a function body that passes all the tests. HumanEval is the simplest, most-cited code benchmark and remains a useful capability floor.

Question 2

Which AI model leads the HumanEval leaderboard?

Accepted Answer

As of 2026-04-24, GPT-5.5 from OpenAI leads the HumanEval leaderboard with a score of 97.1%. The full ranked list of 18 models is on this page, updated as we ingest new scores.

Question 3

How is HumanEval scored?

Accepted Answer

Scores are pass@1: percentage of problems where the model's first attempt passes all tests. The 2026 frontier is above 95%, which means the benchmark is approaching saturation. A 1-point gap at the top is within noise; the more meaningful signal is now SWE-bench.

Question 4

Why does HumanEval matter for AI agents?

Accepted Answer

HumanEval is a fast, cheap proxy for "can the model generate correct Python from a docstring." It is no longer a frontier-level differentiator (most strong models score above 90%) but it is still the easiest sanity check for whether a model is even in the conversation for code work.

Question 5

What does a HumanEval score of 95%+ mean?

Accepted Answer

Saturation. Essentially solves the benchmark.

Question 6

What does a HumanEval score of 85-95% mean?

Accepted Answer

Strong code generation across common patterns.

Question 7

What does a HumanEval score of 70-85% mean?

Accepted Answer

Useful for assisted coding, makes more mistakes.

Question 8

What does a HumanEval score of < 70% mean?

Accepted Answer

Not recommended for production code work.

#	Model	Provider	Score	Released
1	GPT-5.5	OpenAI	97.1%	2026-04
2	Claude Opus 4.7	Anthropic	96.2%	2026-04
3	Claude Opus 4.6	Anthropic	95.1%	2026-03
4	DeepSeek V4 Pro	DeepSeek	94.8%	2026-04
5	o1	OpenAI	94.2%	2025-09
6	Gemini 2.5 Pro	Google	93.8%	2026-01
7	GPT-4.5	OpenAI	93.4%	2025-12
8	Claude Sonnet 4.6	Anthropic	92%	2026-02
9	Llama 4 Maverick	Meta	91.7%	2026-03
10	DeepSeek V3	DeepSeek	91.2%	2025-12
11	GPT-4o	OpenAI	90.2%	2025-05
12	o3-mini	OpenAI	89.7%	2025-11
13	DeepSeek V4 Flash	DeepSeek	89.4%	2026-04
14	Mistral Large	Mistral	89.1%	2025-11
15	Llama 4 Scout	Meta	88.4%	2026-02
16	Gemini 2.0 Flash	Google	87.6%	2025-10
17	Claude Haiku 4.5	Anthropic	86.3%	2026-01
18	Mistral Small	Mistral	82.5%	2025-09

HumanEval leaderboard

Full leaderboard

Score interpretation

Why this matters for AI agents

Other benchmarks

Premium API: time-series for HumanEval