Q: What does a SWE-bench score of < 30% mean?

Plausible-looking code that often does not work.

Question 1

What is SWE-bench?

Accepted Answer

SWE-bench evaluates language models on their ability to resolve real GitHub issues from popular Python repositories. The model is given an issue description and the repository state, and must produce a patch that resolves the issue and passes the project's existing test suite. SWE-bench is the benchmark that most closely tracks "useful for autonomous coding agents" because the tasks are not toy problems, the success criteria is the project's actual tests, and the input footprint forces the model to reason over real-world code at scale.

Question 2

Which AI model leads the SWE-bench leaderboard?

Accepted Answer

As of 2026-04-24, GPT-5.5 from OpenAI leads the SWE-bench leaderboard with a score of 68.7%. The full ranked list of 18 models is on this page, updated as we ingest new scores.

Question 3

How is SWE-bench scored?

Accepted Answer

Scores are reported as resolution rate (% of issues correctly patched). The headline number on TensorFeed is the SWE-bench Verified subset, the human-validated tasks where the test suite has been confirmed to be a fair signal. Anything above 60% as of 2026 represents a genuinely useful coding agent; the very top of the leaderboard is approaching 75-80%.

Question 4

Why does SWE-bench matter for AI agents?

Accepted Answer

If you are building a coding agent, this is the benchmark that matters most. Models with high SWE-bench scores produce patches that compile, pass tests, and respect existing patterns in the codebase. Models with low SWE-bench scores produce code that looks plausible but breaks the build.

Question 5

What does a SWE-bench score of 70%+ mean?

Accepted Answer

Frontier-class. Genuinely useful coding agent territory.

Question 6

What does a SWE-bench score of 50-70% mean?

Accepted Answer

Production-ready for assisted coding workflows.

Question 7

What does a SWE-bench score of 30-50% mean?

Accepted Answer

Useful for narrow tasks but not autonomous agents.

Question 8

What does a SWE-bench score of < 30% mean?

Accepted Answer

Plausible-looking code that often does not work.

#	Model	Provider	Score	Released
1	GPT-5.5	OpenAI	68.7%	2026-04
2	Claude Opus 4.7	Anthropic	65.4%	2026-04
3	DeepSeek V4 Pro	DeepSeek	63.8%	2026-04
4	Claude Opus 4.6	Anthropic	62.3%	2026-03
5	Gemini 2.5 Pro	Google	59.4%	2026-01
6	o1	OpenAI	58.9%	2025-09
7	GPT-4.5	OpenAI	56.1%	2025-12
8	Claude Sonnet 4.6	Anthropic	55.7%	2026-02
9	Llama 4 Maverick	Meta	52.8%	2026-03
10	DeepSeek V3	DeepSeek	51.4%	2025-12
11	o3-mini	OpenAI	49.3%	2025-11
12	DeepSeek V4 Flash	DeepSeek	48.9%	2026-04
13	GPT-4o	OpenAI	48.5%	2025-05
14	Mistral Large	Mistral	46.2%	2025-11
15	Llama 4 Scout	Meta	44.6%	2026-02
16	Gemini 2.0 Flash	Google	43.1%	2025-10
17	Claude Haiku 4.5	Anthropic	41.2%	2026-01
18	Mistral Small	Mistral	34.7%	2025-09

SWE-bench leaderboard

Full leaderboard

Score interpretation

Why this matters for AI agents

Other benchmarks

Premium API: time-series for SWE-bench