LIVE
ANTHROPICOpus 4.7 benchmarks published2m ago
CLAUDEOK142ms
OPUS 4.7$15 / $75per Mtok
CHATGPTOK89ms
HACKERNEWSWhy has not AI improved design quality the way it improved dev speed?14m ago
MMLU-PROleader Opus 4.788.4
GEMINIDEGRADED312ms
MISTRALMistral Medium 3 released6m ago
GPT-4o$5 / $15per Mtok
ARXIVCompositional reasoning in LRMs22m ago
BEDROCKOK178ms
GEMINI 2.5$3.50 / $10.50per Mtok
THE VERGEFrontier Model Forum expansion announced38m ago
SWE-BENCHleader Claude Opus 4.772.1%
MISTRALOK104ms
ANTHROPICOpus 4.7 benchmarks published2m ago
CLAUDEOK142ms
OPUS 4.7$15 / $75per Mtok
CHATGPTOK89ms
HACKERNEWSWhy has not AI improved design quality the way it improved dev speed?14m ago
MMLU-PROleader Opus 4.788.4
GEMINIDEGRADED312ms
MISTRALMistral Medium 3 released6m ago
GPT-4o$5 / $15per Mtok
ARXIVCompositional reasoning in LRMs22m ago
BEDROCKOK178ms
GEMINI 2.5$3.50 / $10.50per Mtok
THE VERGEFrontier Model Forum expansion announced38m ago
SWE-BENCHleader Claude Opus 4.772.1%
MISTRALOK104ms
All benchmarks

SWE-bench leaderboard

SWE-bench evaluates language models on their ability to resolve real GitHub issues from popular Python repositories. The model is given an issue description and the repository state, and must produce a patch that resolves the issue and passes the project's existing test suite. SWE-bench is the benchmark that most closely tracks "useful for autonomous coding agents" because the tasks are not toy problems, the success criteria is the project's actual tests, and the input footprint forces the model to reason over real-world code at scale.

Current leader
GPT-5.5(OpenAI)68.7%

Last refreshed 2026-04-24. 18 models scored on this benchmark.

Full leaderboard

#ModelProviderScoreReleased
1GPT-5.5OpenAI68.7%2026-04
2Claude Opus 4.7Anthropic65.4%2026-04
3DeepSeek V4 ProDeepSeek63.8%2026-04
4Claude Opus 4.6Anthropic62.3%2026-03
5Gemini 2.5 ProGoogle59.4%2026-01
6o1OpenAI58.9%2025-09
7GPT-4.5OpenAI56.1%2025-12
8Claude Sonnet 4.6Anthropic55.7%2026-02
9Llama 4 MaverickMeta52.8%2026-03
10DeepSeek V3DeepSeek51.4%2025-12
11o3-miniOpenAI49.3%2025-11
12DeepSeek V4 FlashDeepSeek48.9%2026-04
13GPT-4oOpenAI48.5%2025-05
14Mistral LargeMistral46.2%2025-11
15Llama 4 ScoutMeta44.6%2026-02
16Gemini 2.0 FlashGoogle43.1%2025-10
17Claude Haiku 4.5Anthropic41.2%2026-01
18Mistral SmallMistral34.7%2025-09

Score interpretation

Scores are reported as resolution rate (% of issues correctly patched). The headline number on TensorFeed is the SWE-bench Verified subset, the human-validated tasks where the test suite has been confirmed to be a fair signal. Anything above 60% as of 2026 represents a genuinely useful coding agent; the very top of the leaderboard is approaching 75-80%.

70%+
Frontier-class. Genuinely useful coding agent territory.
50-70%
Production-ready for assisted coding workflows.
30-50%
Useful for narrow tasks but not autonomous agents.
< 30%
Plausible-looking code that often does not work.

Why this matters for AI agents

If you are building a coding agent, this is the benchmark that matters most. Models with high SWE-bench scores produce patches that compile, pass tests, and respect existing patterns in the codebase. Models with low SWE-bench scores produce code that looks plausible but breaks the build.

Other benchmarks

Premium API: time-series for SWE-bench

The leaderboard above is a snapshot. Want to see how a model's SWE-bench score has moved over the last 30-90 days, or set a webhook that fires when a score crosses a threshold? The premium API has both:

SWE-bench source ·Last refreshed 2026-04-24·Max score 100