AI Coding Harnesses
The same model can score 15 points apart on the same benchmark depending on which agent harness wraps it. This page tracks how the major coding harnesses (Claude Code, Cursor, Codex CLI, Aider, OpenHands, Devin, Cline, Windsurf, Amp, Continue, Roo Code) perform across SWE-bench Verified, Terminal-Bench, Aider Polyglot, and SWE-Lancer. Last updated 2026-04-30.
Most of the AI coding conversation in 2026 is about harnesses, not models. Claude Sonnet 4.6 in Claude Code scores ~71% on SWE-bench Verified. The same Sonnet 4.6 in Continue scores ~52%. The model is identical. The harness is doing the work: tool-use loop, retrieval, planning, the order it reads files in, when it decides to stop and run tests, how it backs off after a failed edit. The harness gap is real and it is the load-bearing thing in most production agent setups.
The matrix below collects the best vendor-published score for each harness × base-model combination across four benchmarks. Tabs above the table switch which benchmark drives the ranked leaderboard view. The full matrix is below that, and each harness name links to a detail page with the harness architecture, model story, and pricing model.
Snapshot of public agentic-coding leaderboard data. Each result is the harness vendor's self-reported best published score for the named base model on the named benchmark. We aggregate; we do not re-run. See sourceUrl on each entry for the upstream report. Refreshed weekly.
SWE-bench Verified: 500 human-validated GitHub issues across 12 Python repos. The harness must produce a patch that resolves the issue and passes the project's test suite.
Scoring unit: % resolved. Max: 100.
SWE-bench Verified Leaderboard
| Rank | Harness | Base Model | Vendor | Type | Score |
|---|---|---|---|---|---|
| #1 | Claude Code | Claude Opus 4.7 | Anthropic | cli | 74.5/ 100 |
| #2 | Codex CLIOSS | GPT-5.5 | OpenAI | cli | 72.8/ 100 |
| #3 | Amp | Claude Sonnet 4.6 | Sourcegraph | ide | 70.8/ 100 |
| #4 | Claude Code | Claude Sonnet 4.6 | Anthropic | cli | 70.6/ 100 |
| #5 | Cursor Agent | GPT-5.5 | Anysphere (Cursor) | ide | 70.1/ 100 |
| #6 | Codex CLIOSS | OpenAI o3 | OpenAI | cli | 69.1/ 100 |
| #7 | Cursor Agent | Claude Sonnet 4.6 | Anysphere (Cursor) | ide | 68.4/ 100 |
| #8 | OpenHandsOSS | Claude Sonnet 4.6 | All Hands AI | agent-platform | 65.8/ 100 |
| #9 | OpenHandsOSS | GPT-5.5 | All Hands AI | agent-platform | 64.2/ 100 |
| #10 | Windsurf Cascade | GPT-5.5 | Codeium | ide | 64.1/ 100 |
| #11 | ClineOSS | Claude Sonnet 4.6 | Cline Bot | ide | 63.4/ 100 |
| #12 | Devin | Proprietary (Sonnet 4.6 + planner) | Cognition Labs | agent-platform | 61.7/ 100 |
| #13 | Windsurf Cascade | SWE-1 (Codeium) | Codeium | ide | 58.2/ 100 |
| #14 | Roo CodeOSS | Claude Sonnet 4.6 | Roo Veterinary Inc. | ide | 57.3/ 100 |
| #15 | ContinueOSS | Claude Sonnet 4.6 | Continue.dev | ide | 52.4/ 100 |
Full Matrix
Every harness × base-model combination across every tracked benchmark. Empty cells mean the vendor has not published a score on that benchmark for that model in that harness.
| Harness | Base Model | SWE-bench Verified | Terminal-Bench | Aider Polyglot | SWE-Lancer |
|---|---|---|---|---|---|
| Claude Code | Claude Opus 4.7 | 74.5 | 52.3 | 84.2 | 41.8 |
| Claude Code | Claude Sonnet 4.6 | 70.6 | 47.1 | 78.4 | 36.2 |
| Cursor Agent | Claude Sonnet 4.6 | 68.4 | 42.0 | — | — |
| Cursor Agent | GPT-5.5 | 70.1 | 41.5 | — | — |
| Codex CLI | GPT-5.5 | 72.8 | 48.2 | 82.1 | 39.6 |
| Codex CLI | OpenAI o3 | 69.1 | 40.4 | 76.9 | — |
| Aider | Claude Opus 4.7 | — | 31.2 | 84.2 | — |
| Aider | GPT-5.5 | — | 28.5 | 81.8 | — |
| Aider | DeepSeek V4 Pro | — | 19.7 | 73.4 | — |
| OpenHands | Claude Sonnet 4.6 | 65.8 | 30.1 | — | 28.4 |
| OpenHands | GPT-5.5 | 64.2 | 29.6 | — | — |
| Devin | Proprietary (Sonnet 4.6 + planner) | 61.7 | — | — | 32.5 |
| Cline | Claude Sonnet 4.6 | 63.4 | — | — | — |
| Windsurf Cascade | GPT-5.5 | 64.1 | 37.8 | — | — |
| Windsurf Cascade | SWE-1 (Codeium) | 58.2 | 30.4 | — | — |
| Amp | Claude Sonnet 4.6 | 70.8 | — | — | — |
| Continue | Claude Sonnet 4.6 | 52.4 | — | — | — |
| Roo Code | Claude Sonnet 4.6 | 57.3 | — | — | — |
Harness Directory
Every harness in the matrix above, with a link to the detail page.
Anthropic's official terminal agent. Native MCP, hooks, slash commands, subagent orchestration, and CLAUDE.md project memory. The harness most aligned with Claude's agentic post-training.
VS Code fork with a multi-file agent ('Composer') and a background agent. Largest paid install base of any AI IDE; routes to whichever frontier model the user picks.
OpenAI's open-source terminal agent. Wraps GPT-5.5 and the o-series with code-execution sandboxing, tool-use loops, and an Apps SDK plugin model.
Pioneer of edit-by-diff coding agents. Maintains its own polyglot leaderboard, runs locally, and works against any Anthropic, OpenAI, Google, or open-source model with an OpenAI-compatible API.
Formerly OpenDevin. Open-source autonomous software engineer with a sandboxed runtime, browser tool, and microservice-style agent architecture. The reference implementation behind several SWE-bench leaderboard entries.
Cognition's hosted autonomous SWE agent. Persistent VM workspaces, Slack and IDE integrations, plus a separate retrieval system (DeepWiki) over the indexed repo.
Open-source VS Code agent. Plan-and-act loop, MCP support, ~$0 baseline (you pay your model provider directly). Most-installed open-source coding agent.
Codeium's IDE agent. Cascade is the multi-step planning loop; backed by either frontier APIs or Codeium's own SWE-1 family.
Sourcegraph's VS Code and JetBrains agent. Anchored on Claude Sonnet 4.6 with a code-graph-aware retrieval layer over the repo. Strong SWE-bench numbers for an IDE agent.
Open-source IDE agent for VS Code and JetBrains. Configurable per-task model routing, custom slash commands, and a local-first model story. Primarily a code assistant with growing agentic features.
Open-source VS Code agent forked from Cline. Multiple specialized 'modes' (Code, Architect, Ask, Debug), MCP support, and aggressive iteration on tool-use loops.
For agents: the same data is served as JSON at /api/harnesses. Free, no auth, cached 5 minutes.