Skip to content
All systems operational0 AI providers monitored, polled every 2 minutes
Live status
All benchmarks

SWE-bench leaderboard

SWE-bench evaluates language models on their ability to resolve real GitHub issues from popular Python repositories. The model is given an issue description and the repository state, and must produce a patch that resolves the issue and passes the project's existing test suite. SWE-bench is the benchmark that most closely tracks "useful for autonomous coding agents" because the tasks are not toy problems, the success criteria is the project's actual tests, and the input footprint forces the model to reason over real-world code at scale.

Current leader
Claude Fable 5(Anthropic)95%

Last refreshed 2026-06-11. 17 models scored on this benchmark.

Full leaderboard

#ModelProviderScoreReleased
1Claude Fable 5Anthropic95%2026-06
2Claude Opus 4.8Anthropic88.6%2026-05
3Claude Opus 4.7Anthropic87.6%2026-04
4GPT-5.5OpenAI82.6%2026-04
5Claude Opus 4.6Anthropic80.8%2026-03
6DeepSeek V4 ProDeepSeek80.6%2026-04
7Claude Sonnet 4.6Anthropic79.6%2026-02
8DeepSeek V4 FlashDeepSeek79%2026-04
9Claude Haiku 4.5Anthropic73.3%2026-01
10Gemini 2.5 ProGoogle63.8%2026-01
11o3-miniOpenAI49.3%2025-11
12o1OpenAI48.9%2025-09
13Mistral LargeMistral47.2%2025-11
14DeepSeek V3DeepSeek42%2025-12
15GPT-4.5OpenAI38%2025-12
16GPT-4oOpenAI33.2%2025-05
17Llama 4 MaverickMeta24%2026-03

Score interpretation

Scores are reported as resolution rate (% of issues correctly patched). The headline number on TensorFeed is the SWE-bench Verified subset, the human-validated tasks where the test suite has been confirmed to be a fair signal. Anything above 60% as of 2026 represents a genuinely useful coding agent; the very top of the leaderboard is approaching 75-80%.

70%+
Frontier-class. Genuinely useful coding agent territory.
50-70%
Production-ready for assisted coding workflows.
30-50%
Useful for narrow tasks but not autonomous agents.
< 30%
Plausible-looking code that often does not work.

Why this matters for AI agents

If you are building a coding agent, this is the benchmark that matters most. Models with high SWE-bench scores produce patches that compile, pass tests, and respect existing patterns in the codebase. Models with low SWE-bench scores produce code that looks plausible but breaks the build.

Other benchmarks

Premium API: time-series for SWE-bench

The leaderboard above is a snapshot. Want to see how a model's SWE-bench score has moved over the last 30-90 days, or set a webhook that fires when a score crosses a threshold? The premium API has both:

SWE-bench source ·Last refreshed 2026-06-11·Max score 100