LIVE
ANTHROPICOpus 4.7 benchmarks published2m ago
CLAUDEOK142ms
OPUS 4.7$15 / $75per Mtok
CHATGPTOK89ms
HACKERNEWSWhy has not AI improved design quality the way it improved dev speed?14m ago
MMLU-PROleader Opus 4.788.4
GEMINIDEGRADED312ms
MISTRALMistral Medium 3 released6m ago
GPT-4o$5 / $15per Mtok
ARXIVCompositional reasoning in LRMs22m ago
BEDROCKOK178ms
GEMINI 2.5$3.50 / $10.50per Mtok
THE VERGEFrontier Model Forum expansion announced38m ago
SWE-BENCHleader Claude Opus 4.772.1%
MISTRALOK104ms
ANTHROPICOpus 4.7 benchmarks published2m ago
CLAUDEOK142ms
OPUS 4.7$15 / $75per Mtok
CHATGPTOK89ms
HACKERNEWSWhy has not AI improved design quality the way it improved dev speed?14m ago
MMLU-PROleader Opus 4.788.4
GEMINIDEGRADED312ms
MISTRALMistral Medium 3 released6m ago
GPT-4o$5 / $15per Mtok
ARXIVCompositional reasoning in LRMs22m ago
BEDROCKOK178ms
GEMINI 2.5$3.50 / $10.50per Mtok
THE VERGEFrontier Model Forum expansion announced38m ago
SWE-BENCHleader Claude Opus 4.772.1%
MISTRALOK104ms
All harnesses

OpenHands

All Hands AI

OpenHands started as the open-source OpenDevin project and now ships as the reference implementation behind several top SWE-bench Verified entries. Architecturally it is a sandboxed runtime plus a small set of agent processes (CodeAct, Browser, Planner) that share a workspace. Most agentic-coding research papers in 2025-2026 use OpenHands as their substrate.

Type
agent-platform
License
Open source
Model story
Multi-model
Vendor
All Hands AI

Leaderboard Placements

BenchmarkBest base modelScoreRank
SWE-bench Verified Claude Sonnet 4.665.8#8 / 15
Terminal-Bench Claude Sonnet 4.630.1#10 / 13
Aider Polyglot
SWE-Lancer Claude Sonnet 4.628.4#5 / 5

Distribution

Open-source. Run as a Docker container locally or on a hosted runtime. MIT license.

Model Story

Multi-model. Most entries use Claude Sonnet 4.6 or GPT-5.5; the harness has no preferred model.

Pricing

Free harness; you pay for the underlying API tokens and any compute you host.

Who It's For

Researchers and teams building on top of an open agentic substrate, plus anyone who wants the same harness public benchmarks are run on.

Notable Features

  • CodeAct: agent expresses actions as Python code
  • Built-in browser tool for web tasks
  • Sandboxed Docker runtime per session
  • Microservice-style agent architecture (swap planners freely)
  • Reference implementation for SWE-bench paper submissions

Other Harnesses