What is an AI coding harness?

A coding harness is the agent scaffolding around a base LLM: the tool-use loop, file-edit primitives, shell sandbox, planning logic, retrieval, and approval gating. The same model can score very differently on the same benchmark depending on which harness wraps it. Claude Code, Cursor Agent, Codex CLI, Aider, OpenHands, and Devin are all harnesses; Claude Sonnet 4.6 and GPT-5.5 are the base models they wrap.

Why does the harness matter as much as the model?

Because most of what a coding agent does is not raw token generation, it is tool use: deciding when to read a file, run a test, search the codebase, or stop and ask. A weaker model in a strong harness routinely beats a stronger model in a weak harness on agentic benchmarks. The 2025-2026 SWE-bench Verified leaderboard shows the same model varying by 5-15 percentage points purely from harness choice.

Which harness leads SWE-bench Verified?

As of the latest snapshot on this page, Claude Code paired with Claude Opus 4.7 leads SWE-bench Verified, followed closely by Codex CLI on GPT-5.5 and Amp on Claude Sonnet 4.6. The full ranked table updates as we ingest new vendor-published scores.

Are these scores TensorFeed measurements or vendor self-reports?

Vendor self-reports. Each row is the harness vendor's best published score for the named base model on the named benchmark, with a link to the upstream report. TensorFeed aggregates and renormalizes; we do not re-run the benchmarks ourselves. The exception is our LLM Probe data (provider latency and availability) which we measure at /api/probe/latest.

AI Coding Harnesses

The same model can score 15 points apart on the same benchmark depending on which agent harness wraps it. This page tracks how the major coding harnesses (Claude Code, Cursor, Codex CLI, Aider, OpenHands, Devin, Cline, Windsurf, Amp, Continue, Roo Code) perform across SWE-bench Verified, Terminal-Bench, Aider Polyglot, and SWE-Lancer. Last updated 2026-04-30.

Most of the AI coding conversation in 2026 is about harnesses, not models. Claude Sonnet 4.6 in Claude Code scores ~71% on SWE-bench Verified. The same Sonnet 4.6 in Continue scores ~52%. The model is identical. The harness is doing the work: tool-use loop, retrieval, planning, the order it reads files in, when it decides to stop and run tests, how it backs off after a failed edit. The harness gap is real and it is the load-bearing thing in most production agent setups.

The matrix below collects the best vendor-published score for each harness × base-model combination across four benchmarks. Tabs above the table switch which benchmark drives the ranked leaderboard view. The full matrix is below that, and each harness name links to a detail page with the harness architecture, model story, and pricing model.

Snapshot of public agentic-coding leaderboard data. Each result is the harness vendor's self-reported best published score for the named base model on the named benchmark. We aggregate; we do not re-run. See sourceUrl on each entry for the upstream report. Refreshed weekly.

SWE-bench Verified: 500 human-validated GitHub issues across 12 Python repos. The harness must produce a patch that resolves the issue and passes the project's test suite.

Scoring unit: % resolved. Max: 100.

Upstream

SWE-bench Verified Leaderboard

Rank	Harness	Base Model	Vendor	Type	Score
#1	Claude Code	Claude Opus 4.7	Anthropic	cli	74.5/ 100
#2	Codex CLIOSS	GPT-5.5	OpenAI	cli	72.8/ 100
#3	Amp	Claude Sonnet 4.6	Sourcegraph	ide	70.8/ 100
#4	Claude Code	Claude Sonnet 4.6	Anthropic	cli	70.6/ 100
#5	Cursor Agent	GPT-5.5	Anysphere (Cursor)	ide	70.1/ 100
#6	Codex CLIOSS	OpenAI o3	OpenAI	cli	69.1/ 100
#7	Cursor Agent	Claude Sonnet 4.6	Anysphere (Cursor)	ide	68.4/ 100
#8	OpenHandsOSS	Claude Sonnet 4.6	All Hands AI	agent-platform	65.8/ 100
#9	OpenHandsOSS	GPT-5.5	All Hands AI	agent-platform	64.2/ 100
#10	Windsurf Cascade	GPT-5.5	Codeium	ide	64.1/ 100
#11	ClineOSS	Claude Sonnet 4.6	Cline Bot	ide	63.4/ 100
#12	Devin	Proprietary (Sonnet 4.6 + planner)	Cognition Labs	agent-platform	61.7/ 100
#13	Windsurf Cascade	SWE-1 (Codeium)	Codeium	ide	58.2/ 100
#14	Roo CodeOSS	Claude Sonnet 4.6	Roo Veterinary Inc.	ide	57.3/ 100
#15	ContinueOSS	Claude Sonnet 4.6	Continue.dev	ide	52.4/ 100

Full Matrix

Every harness × base-model combination across every tracked benchmark. Empty cells mean the vendor has not published a score on that benchmark for that model in that harness.

Harness	Base Model	SWE-bench Verified	Terminal-Bench	Aider Polyglot	SWE-Lancer
Claude Code	Claude Opus 4.7	74.5	52.3	84.2	41.8
Claude Code	Claude Sonnet 4.6	70.6	47.1	78.4	36.2
Cursor Agent	Claude Sonnet 4.6	68.4	42.0	—	—
Cursor Agent	GPT-5.5	70.1	41.5	—	—
Codex CLI	GPT-5.5	72.8	48.2	82.1	39.6
Codex CLI	OpenAI o3	69.1	40.4	76.9	—
Aider	Claude Opus 4.7	—	31.2	84.2	—
Aider	GPT-5.5	—	28.5	81.8	—
Aider	DeepSeek V4 Pro	—	19.7	73.4	—
OpenHands	Claude Sonnet 4.6	65.8	30.1	—	28.4
OpenHands	GPT-5.5	64.2	29.6	—	—
Devin	Proprietary (Sonnet 4.6 + planner)	61.7	—	—	32.5
Cline	Claude Sonnet 4.6	63.4	—	—	—
Windsurf Cascade	GPT-5.5	64.1	37.8	—	—
Windsurf Cascade	SWE-1 (Codeium)	58.2	30.4	—	—
Amp	Claude Sonnet 4.6	70.8	—	—	—
Continue	Claude Sonnet 4.6	52.4	—	—	—
Roo Code	Claude Sonnet 4.6	57.3	—	—	—

Harness Directory

Every harness in the matrix above, with a link to the detail page.

Claude Code

Anthropic

cli

Anthropic's official terminal agent. Native MCP, hooks, slash commands, subagent orchestration, and CLAUDE.md project memory. The harness most aligned with Claude's agentic post-training.

Anthropic models only

Cursor Agent

Anysphere (Cursor)

ide

VS Code fork with a multi-file agent ('Composer') and a background agent. Largest paid install base of any AI IDE; routes to whichever frontier model the user picks.

OpenAI's open-source terminal agent. Wraps GPT-5.5 and the o-series with code-execution sandboxing, tool-use loops, and an Apps SDK plugin model.

OSSOpenAI models only

Aider

Paul Gauthier

cli

Pioneer of edit-by-diff coding agents. Maintains its own polyglot leaderboard, runs locally, and works against any Anthropic, OpenAI, Google, or open-source model with an OpenAI-compatible API.

Formerly OpenDevin. Open-source autonomous software engineer with a sandboxed runtime, browser tool, and microservice-style agent architecture. The reference implementation behind several SWE-bench leaderboard entries.

Cognition's hosted autonomous SWE agent. Persistent VM workspaces, Slack and IDE integrations, plus a separate retrieval system (DeepWiki) over the indexed repo.

Open-source VS Code agent. Plan-and-act loop, MCP support, ~$0 baseline (you pay your model provider directly). Most-installed open-source coding agent.

Codeium's IDE agent. Cascade is the multi-step planning loop; backed by either frontier APIs or Codeium's own SWE-1 family.

Multi-model, in-house option

Amp

Sourcegraph

ide

Sourcegraph's VS Code and JetBrains agent. Anchored on Claude Sonnet 4.6 with a code-graph-aware retrieval layer over the repo. Strong SWE-bench numbers for an IDE agent.

Open-source IDE agent for VS Code and JetBrains. Configurable per-task model routing, custom slash commands, and a local-first model story. Primarily a code assistant with growing agentic features.

Open-source VS Code agent forked from Cline. Multiple specialized 'modes' (Code, Architect, Ask, Debug), MCP support, and aggressive iteration on tool-use loops.

OSSMulti-model, BYOK

For agents: the same data is served as JSON at /api/harnesses. Free, no auth, cached 5 minutes.