AI Coding Harnesses
The same model can score 15 points apart on the same benchmark depending on which agent harness wraps it. This page tracks how the major coding harnesses (Claude Code, Cursor, Codex CLI, Aider, OpenHands, Devin, Cline, Windsurf, Amp, Continue, Roo Code) perform across SWE-bench Verified, Terminal-Bench, Aider Polyglot, and SWE-Lancer. Last updated 2026-04-30.
Machine-readable JSON/api/harnessesMost of the AI coding conversation in 2026 is about harnesses, not models. Claude Sonnet 4.6 in Claude Code scores ~71% on SWE-bench Verified. The same Sonnet 4.6 in Continue scores ~52%. The model is identical. The harness is doing the work: tool-use loop, retrieval, planning, the order it reads files in, when it decides to stop and run tests, how it backs off after a failed edit. The harness gap is real and it is the load-bearing thing in most production agent setups.
The matrix below collects the best vendor-published score for each harness × base-model combination across four benchmarks. Tabs above the table switch which benchmark drives the ranked leaderboard view. The full matrix is below that, and each harness name links to a detail page with the harness architecture, model story, and pricing model.
Snapshot of public agentic-coding leaderboard data. Each result is the harness vendor's self-reported best published score for the named base model on the named benchmark. We aggregate; we do not re-run. See sourceUrl on each entry for the upstream report. Refreshed weekly.
SWE-bench Verified: 500 human-validated GitHub issues across 12 Python repos. The harness must produce a patch that resolves the issue and passes the project's test suite.
Scoring unit: % resolved. Max: 100.
SWE-bench Verified Leaderboard
| Rank | Harness | Base Model | Vendor | Type | Score |
|---|---|---|---|---|---|
| #1 | Claude Code | Claude Opus 4.7 | Anthropic | cli | 74.5/ 100 |
| #2 | Codex CLIOSS | GPT-5.5 | OpenAI | cli | 72.8/ 100 |
| #3 | Amp | Claude Sonnet 4.6 | Sourcegraph | ide | 70.8/ 100 |
| #4 | Claude Code | Claude Sonnet 4.6 | Anthropic | cli | 70.6/ 100 |
| #5 | Cursor Agent | GPT-5.5 | Anysphere (Cursor) | ide | 70.1/ 100 |
| #6 | Codex CLIOSS | OpenAI o3 | OpenAI | cli | 69.1/ 100 |
| #7 | Cursor Agent | Claude Sonnet 4.6 | Anysphere (Cursor) | ide | 68.4/ 100 |
| #8 | OpenHandsOSS | Claude Sonnet 4.6 | All Hands AI | agent-platform | 65.8/ 100 |
| #9 | OpenHandsOSS | GPT-5.5 | All Hands AI | agent-platform | 64.2/ 100 |
| #10 | Windsurf Cascade | GPT-5.5 | Codeium | ide | 64.1/ 100 |
| #11 | ClineOSS | Claude Sonnet 4.6 | Cline Bot | ide | 63.4/ 100 |
| #12 | Devin | Proprietary (Sonnet 4.6 + planner) | Cognition Labs | agent-platform | 61.7/ 100 |
| #13 | Windsurf Cascade | SWE-1 (Codeium) | Codeium | ide | 58.2/ 100 |
| #14 | Roo CodeOSS | Claude Sonnet 4.6 | Roo Veterinary Inc. | ide | 57.3/ 100 |
| #15 | ContinueOSS | Claude Sonnet 4.6 | Continue.dev | ide | 52.4/ 100 |
Full Matrix
Every harness × base-model combination across every tracked benchmark. Empty cells mean the vendor has not published a score on that benchmark for that model in that harness.
| Harness | Base Model | SWE-bench Verified | Terminal-Bench | Aider Polyglot | SWE-Lancer |
|---|---|---|---|---|---|
| Claude Code | Claude Opus 4.7 | 74.5 | 52.3 | 84.2 | 41.8 |
| Claude Code | Claude Sonnet 4.6 | 70.6 | 47.1 | 78.4 | 36.2 |
| Cursor Agent | Claude Sonnet 4.6 | 68.4 | 42.0 | - | - |
| Cursor Agent | GPT-5.5 | 70.1 | 41.5 | - | - |
| Codex CLI | GPT-5.5 | 72.8 | 48.2 | 82.1 | 39.6 |
| Codex CLI | OpenAI o3 | 69.1 | 40.4 | 76.9 | - |
| Aider | Claude Opus 4.7 | - | 31.2 | 84.2 | - |
| Aider | GPT-5.5 | - | 28.5 | 81.8 | - |
| Aider | DeepSeek V4 Pro | - | 19.7 | 73.4 | - |
| OpenHands | Claude Sonnet 4.6 | 65.8 | 30.1 | - | 28.4 |
| OpenHands | GPT-5.5 | 64.2 | 29.6 | - | - |
| Devin | Proprietary (Sonnet 4.6 + planner) | 61.7 | - | - | 32.5 |
| Cline | Claude Sonnet 4.6 | 63.4 | - | - | - |
| Windsurf Cascade | GPT-5.5 | 64.1 | 37.8 | - | - |
| Windsurf Cascade | SWE-1 (Codeium) | 58.2 | 30.4 | - | - |
| Amp | Claude Sonnet 4.6 | 70.8 | - | - | - |
| Continue | Claude Sonnet 4.6 | 52.4 | - | - | - |
| Roo Code | Claude Sonnet 4.6 | 57.3 | - | - | - |
Harness Directory
Every harness in the matrix above, with a link to the detail page.
Anthropic's official terminal agent. Native MCP, hooks, slash commands, subagent orchestration, and CLAUDE.md project memory.
VS Code fork with a multi-file agent (Composer) and a hosted background agent. Largest paid install base of any AI IDE.
OpenAI's open-source terminal agent. Sandboxed code execution, OpenAI Apps SDK plug-ins, MIT license.
Open-source CLI. Edit-by-diff over whole-file rewrites; runs on any OpenAI-compatible model. Maintains the Polyglot leaderboard.
Formerly OpenDevin. Open-source autonomous SWE agent with sandboxed runtime, browser tool, and microservice agent architecture.
Hosted autonomous SWE agent with persistent VM workspaces, Slack and IDE integrations, and DeepWiki repo retrieval.
Most-installed open-source VS Code agent. Plan-and-act loop with explicit human approval, MCP support, BYOK pricing.
Standalone IDE with Cascade multi-step agent loop. Backed by either frontier APIs or Codeium's own SWE-1 model family.
Sourcegraph's VS Code and JetBrains agent. Anchored on Sonnet 4.6, layered on a code-graph retrieval system that scales to monorepos.
Open-source VS Code and JetBrains agent. First-class local model support (Ollama, LM Studio), per-task model routing.
Open-source VS Code agent forked from Cline. Specialized modes (Code, Architect, Ask, Debug), MCP support.
For agents: the same data is served as JSON at /api/harnesses. Free, no auth, cached 5 minutes.