SWE-bench leaderboard
SWE-bench evaluates language models on their ability to resolve real GitHub issues from popular Python repositories. The model is given an issue description and the repository state, and must produce a patch that resolves the issue and passes the project's existing test suite. SWE-bench is the benchmark that most closely tracks "useful for autonomous coding agents" because the tasks are not toy problems, the success criteria is the project's actual tests, and the input footprint forces the model to reason over real-world code at scale.
Full leaderboard
| # | Model | Provider | Score | Released |
|---|---|---|---|---|
| 1 | GPT-5.5 | OpenAI | 68.7% | 2026-04 |
| 2 | Claude Opus 4.7 | Anthropic | 65.4% | 2026-04 |
| 3 | DeepSeek V4 Pro | DeepSeek | 63.8% | 2026-04 |
| 4 | Claude Opus 4.6 | Anthropic | 62.3% | 2026-03 |
| 5 | Gemini 2.5 Pro | 59.4% | 2026-01 | |
| 6 | o1 | OpenAI | 58.9% | 2025-09 |
| 7 | GPT-4.5 | OpenAI | 56.1% | 2025-12 |
| 8 | Claude Sonnet 4.6 | Anthropic | 55.7% | 2026-02 |
| 9 | Llama 4 Maverick | Meta | 52.8% | 2026-03 |
| 10 | DeepSeek V3 | DeepSeek | 51.4% | 2025-12 |
| 11 | o3-mini | OpenAI | 49.3% | 2025-11 |
| 12 | DeepSeek V4 Flash | DeepSeek | 48.9% | 2026-04 |
| 13 | GPT-4o | OpenAI | 48.5% | 2025-05 |
| 14 | Mistral Large | Mistral | 46.2% | 2025-11 |
| 15 | Llama 4 Scout | Meta | 44.6% | 2026-02 |
| 16 | Gemini 2.0 Flash | 43.1% | 2025-10 | |
| 17 | Claude Haiku 4.5 | Anthropic | 41.2% | 2026-01 |
| 18 | Mistral Small | Mistral | 34.7% | 2025-09 |
Score interpretation
Scores are reported as resolution rate (% of issues correctly patched). The headline number on TensorFeed is the SWE-bench Verified subset, the human-validated tasks where the test suite has been confirmed to be a fair signal. Anything above 60% as of 2026 represents a genuinely useful coding agent; the very top of the leaderboard is approaching 75-80%.
Why this matters for AI agents
If you are building a coding agent, this is the benchmark that matters most. Models with high SWE-bench scores produce patches that compile, pass tests, and respect existing patterns in the codebase. Models with low SWE-bench scores produce code that looks plausible but breaks the build.
Other benchmarks
Premium API: time-series for SWE-bench
The leaderboard above is a snapshot. Want to see how a model's SWE-bench score has moved over the last 30-90 days, or set a webhook that fires when a score crosses a threshold? The premium API has both:
/api/premium/history/benchmarks/series?model=&benchmark=swe_bench— daily score evolution for one model on this benchmark, 1 credit per call/api/premium/forecast?target=benchmark&benchmark=swe_bench— 1-30 day projection with 95% prediction interval