LIVE
ANTHROPICOpus 4.7 benchmarks published2m ago
CLAUDEOK142ms
OPUS 4.7$15 / $75per Mtok
CHATGPTOK89ms
HACKERNEWSWhy has not AI improved design quality the way it improved dev speed?14m ago
MMLU-PROleader Opus 4.788.4
GEMINIDEGRADED312ms
MISTRALMistral Medium 3 released6m ago
GPT-4o$5 / $15per Mtok
ARXIVCompositional reasoning in LRMs22m ago
BEDROCKOK178ms
GEMINI 2.5$3.50 / $10.50per Mtok
THE VERGEFrontier Model Forum expansion announced38m ago
SWE-BENCHleader Claude Opus 4.772.1%
MISTRALOK104ms
ANTHROPICOpus 4.7 benchmarks published2m ago
CLAUDEOK142ms
OPUS 4.7$15 / $75per Mtok
CHATGPTOK89ms
HACKERNEWSWhy has not AI improved design quality the way it improved dev speed?14m ago
MMLU-PROleader Opus 4.788.4
GEMINIDEGRADED312ms
MISTRALMistral Medium 3 released6m ago
GPT-4o$5 / $15per Mtok
ARXIVCompositional reasoning in LRMs22m ago
BEDROCKOK178ms
GEMINI 2.5$3.50 / $10.50per Mtok
THE VERGEFrontier Model Forum expansion announced38m ago
SWE-BENCHleader Claude Opus 4.772.1%
MISTRALOK104ms

AI Coding Harnesses

The same model can score 15 points apart on the same benchmark depending on which agent harness wraps it. This page tracks how the major coding harnesses (Claude Code, Cursor, Codex CLI, Aider, OpenHands, Devin, Cline, Windsurf, Amp, Continue, Roo Code) perform across SWE-bench Verified, Terminal-Bench, Aider Polyglot, and SWE-Lancer. Last updated 2026-04-30.

Most of the AI coding conversation in 2026 is about harnesses, not models. Claude Sonnet 4.6 in Claude Code scores ~71% on SWE-bench Verified. The same Sonnet 4.6 in Continue scores ~52%. The model is identical. The harness is doing the work: tool-use loop, retrieval, planning, the order it reads files in, when it decides to stop and run tests, how it backs off after a failed edit. The harness gap is real and it is the load-bearing thing in most production agent setups.

The matrix below collects the best vendor-published score for each harness × base-model combination across four benchmarks. Tabs above the table switch which benchmark drives the ranked leaderboard view. The full matrix is below that, and each harness name links to a detail page with the harness architecture, model story, and pricing model.

Snapshot of public agentic-coding leaderboard data. Each result is the harness vendor's self-reported best published score for the named base model on the named benchmark. We aggregate; we do not re-run. See sourceUrl on each entry for the upstream report. Refreshed weekly.

SWE-bench Verified: 500 human-validated GitHub issues across 12 Python repos. The harness must produce a patch that resolves the issue and passes the project's test suite.

Scoring unit: % resolved. Max: 100.

Upstream

SWE-bench Verified Leaderboard

RankHarnessBase ModelVendorTypeScore
#1Claude CodeClaude Opus 4.7Anthropiccli74.5/ 100
#2Codex CLIOSSGPT-5.5OpenAIcli72.8/ 100
#3AmpClaude Sonnet 4.6Sourcegraphide70.8/ 100
#4Claude CodeClaude Sonnet 4.6Anthropiccli70.6/ 100
#5Cursor AgentGPT-5.5Anysphere (Cursor)ide70.1/ 100
#6Codex CLIOSSOpenAI o3OpenAIcli69.1/ 100
#7Cursor AgentClaude Sonnet 4.6Anysphere (Cursor)ide68.4/ 100
#8OpenHandsOSSClaude Sonnet 4.6All Hands AIagent-platform65.8/ 100
#9OpenHandsOSSGPT-5.5All Hands AIagent-platform64.2/ 100
#10Windsurf CascadeGPT-5.5Codeiumide64.1/ 100
#11ClineOSSClaude Sonnet 4.6Cline Botide63.4/ 100
#12DevinProprietary (Sonnet 4.6 + planner)Cognition Labsagent-platform61.7/ 100
#13Windsurf CascadeSWE-1 (Codeium)Codeiumide58.2/ 100
#14Roo CodeOSSClaude Sonnet 4.6Roo Veterinary Inc.ide57.3/ 100
#15ContinueOSSClaude Sonnet 4.6Continue.devide52.4/ 100

Full Matrix

Every harness × base-model combination across every tracked benchmark. Empty cells mean the vendor has not published a score on that benchmark for that model in that harness.

HarnessBase ModelSWE-bench VerifiedTerminal-BenchAider PolyglotSWE-Lancer
Claude CodeClaude Opus 4.774.552.384.241.8
Claude CodeClaude Sonnet 4.670.647.178.436.2
Cursor AgentClaude Sonnet 4.668.442.0
Cursor AgentGPT-5.570.141.5
Codex CLIGPT-5.572.848.282.139.6
Codex CLIOpenAI o369.140.476.9
AiderClaude Opus 4.731.284.2
AiderGPT-5.528.581.8
AiderDeepSeek V4 Pro19.773.4
OpenHandsClaude Sonnet 4.665.830.128.4
OpenHandsGPT-5.564.229.6
DevinProprietary (Sonnet 4.6 + planner)61.732.5
ClineClaude Sonnet 4.663.4
Windsurf CascadeGPT-5.564.137.8
Windsurf CascadeSWE-1 (Codeium)58.230.4
AmpClaude Sonnet 4.670.8
ContinueClaude Sonnet 4.652.4
Roo CodeClaude Sonnet 4.657.3

Harness Directory

Every harness in the matrix above, with a link to the detail page.

Claude Code
Anthropic
cli

Anthropic's official terminal agent. Native MCP, hooks, slash commands, subagent orchestration, and CLAUDE.md project memory. The harness most aligned with Claude's agentic post-training.

Anthropic models only
Cursor Agent
Anysphere (Cursor)
ide

VS Code fork with a multi-file agent ('Composer') and a background agent. Largest paid install base of any AI IDE; routes to whichever frontier model the user picks.

Multi-model, BYOK
Codex CLI
OpenAI
cli

OpenAI's open-source terminal agent. Wraps GPT-5.5 and the o-series with code-execution sandboxing, tool-use loops, and an Apps SDK plugin model.

OSSOpenAI models only
Aider
Paul Gauthier
cli

Pioneer of edit-by-diff coding agents. Maintains its own polyglot leaderboard, runs locally, and works against any Anthropic, OpenAI, Google, or open-source model with an OpenAI-compatible API.

OSSMulti-model, BYOK
OpenHands
All Hands AI
agent-platform

Formerly OpenDevin. Open-source autonomous software engineer with a sandboxed runtime, browser tool, and microservice-style agent architecture. The reference implementation behind several SWE-bench leaderboard entries.

OSSMulti-model
Devin
Cognition Labs
agent-platform

Cognition's hosted autonomous SWE agent. Persistent VM workspaces, Slack and IDE integrations, plus a separate retrieval system (DeepWiki) over the indexed repo.

Proprietary mix
Cline
Cline Bot
ide

Open-source VS Code agent. Plan-and-act loop, MCP support, ~$0 baseline (you pay your model provider directly). Most-installed open-source coding agent.

OSSMulti-model, BYOK
Windsurf Cascade
Codeium
ide

Codeium's IDE agent. Cascade is the multi-step planning loop; backed by either frontier APIs or Codeium's own SWE-1 family.

Multi-model, in-house option
Amp
Sourcegraph
ide

Sourcegraph's VS Code and JetBrains agent. Anchored on Claude Sonnet 4.6 with a code-graph-aware retrieval layer over the repo. Strong SWE-bench numbers for an IDE agent.

Sonnet 4.6 default
Continue
Continue.dev
ide

Open-source IDE agent for VS Code and JetBrains. Configurable per-task model routing, custom slash commands, and a local-first model story. Primarily a code assistant with growing agentic features.

OSSMulti-model, BYOK
Roo Code
Roo Veterinary Inc.
ide

Open-source VS Code agent forked from Cline. Multiple specialized 'modes' (Code, Architect, Ask, Debug), MCP support, and aggressive iteration on tool-use loops.

OSSMulti-model, BYOK

For agents: the same data is served as JSON at /api/harnesses. Free, no auth, cached 5 minutes.