LIVE
OPUS 4.7$15 / $75per Mtok
SONNET 4.6$3 / $15per Mtok
GPT-5.5$10 / $30per Mtok
GEMINI 3.1$3.50 / $10.50per Mtok
SWE-BENCHleader Claude Opus 4.772.1%
MMLU-PROleader Opus 4.788.4
VALS FINANCEleader Opus 4.764.4%
AFTAv1.0 whitepaper live at /whitepaper
OPUS 4.7$15 / $75per Mtok
SONNET 4.6$3 / $15per Mtok
GPT-5.5$10 / $30per Mtok
GEMINI 3.1$3.50 / $10.50per Mtok
SWE-BENCHleader Claude Opus 4.772.1%
MMLU-PROleader Opus 4.788.4
VALS FINANCEleader Opus 4.764.4%
AFTAv1.0 whitepaper live at /whitepaper
All systems operational0 AI providers monitored, polled every 2 minutes
Live status
All TF Verdicts
TF Verdict·Models··Medium confidence

What is the best-value open-weight model for coding agents today (as of 29 May 2026), judged on coding-agent benchmark resolve rate against hosted inference price, with output tokens as the binding cost?

The verdict

As of 29 May 2026, DeepSeek V4-Pro is the best-value open-weight model for coding agents: at 80.6% on SWE-bench Verified and 67.9% on Terminal-Bench 2.0 it leads the open-weight field, and its now-permanent $0.435 in / $0.87 out per million tokens makes its cost per resolved task lower than DeepSeek V3.2 once you count failed and retried tasks (an inference from price plus resolve rate, assuming comparable token use per task), with GLM-5.1 the only close rival and DeepSeek V3.2 the pick when raw output price is the single constraint.

Our verdict, as of 29 May 2026: if you are wiring an open-weight model into a coding agent today, DeepSeek V4-Pro is the best value. This is a change from where we landed earlier this year, when V3.2 owned the cheapest output token and the field had nothing above it.

Value is resolve rate over cost, and for agents the cost that bites is output tokens, because agents stream long diffs, tool calls, and reasoning.

On capability V4-Pro leads the open field: 80.6% on SWE-bench Verified per its Hugging Face card, and 67.9% on Terminal-Bench 2.0. Standard DeepSeek V3.2 sits at a vendor-reported 73.1% on SWE-bench (DeepSeek's own paper, not independently audited) and 39.6% on Terminal-Bench 2.0 under the third-party Terminus 2 harness.

On price V4-Pro is $0.435 in and $0.87 out per million on OpenRouter, a rate DeepSeek made permanent on 22 May after the 75% cut. That output is about 2.3 times V3.2's $0.378. But the SWE-bench gain is roughly 7.5 points and the terminal-agent gain is large, so on a cost-per-resolved-task basis, which is an inference from price plus resolve rate, not a measured figure, V4-Pro comes out ahead once you count retries.

Two carve-outs. GLM-5.1 is the real rival at a self-reported 77.8% SWE-bench (Z.ai's own figure, no published independent audit), but its $3.08 to $4.40 output runs 2x to 4x V4-Pro. And if raw output price is your only constraint, V3.2 stays the budget floor at $0.378 with a clean 131K context.

Bottom line: V4-Pro for value, GLM-5.1 if you must audit a rival, V3.2 when the meter is everything.

The evidence

The data points behind this verdict. Each is cited so you can check the call against its source.

DeepSeek V4-Pro SWE-bench Verified score, leading the open-weight field

80.6%

DeepSeek-V4-Pro model card, Hugging Face

DeepSeek V4-Pro Terminal-Bench 2.0 accuracy (per the model card's agentic results)

67.9%

DeepSeek-V4-Pro model card, Hugging Face

DeepSeek V4-Pro hosted inference price on OpenRouter (now-permanent rate after the 75% cut announced 22 May 2026)

$0.435 input / $0.87 output per million tokens, 1M context

OpenRouter DeepSeek V4 Pro model page

DeepSeek V4-Pro release date and license: permissive, commercial-use (open-weight, not fully open-source)

released 24 April 2026, permissive weights on Hugging Face

Codersera DeepSeek V4-Pro review

Standard DeepSeek V3.2 SWE-bench Verified primary score (vendor-reported, robust across Claude Code and RooCode harnesses to 72 to 74%)

73.1%

DeepSeek-V3.2 technical report (arXiv 2512.02556)

Standard DeepSeek V3.2 Terminal-Bench 2.0 accuracy under the third-party Terminus 2 agent (DeepSeek's own paper reports a higher 46.4% under Claude Code)

39.6% (plus or minus 2.8)

Terminal-Bench 2.0 leaderboard

Standard DeepSeek V3.2 hosted inference price on OpenRouter, cheapest raw output of the open contenders

$0.252 input / $0.378 output per million tokens, 131K context

OpenRouter DeepSeek V3.2 model page

DeepSeek V3.2-Exp Aider Polyglot score, the top open-source entry on the leaderboard (distinct experimental checkpoint, not the priced standard V3.2)

74.5%

llm-stats Aider Polyglot leaderboard

GLM-5.1 SWE-bench Verified score, the closest open-weight rival to V4-Pro on capability (self-reported by Z.ai, no published independent audit)

about 77.8% (self-reported)

Z.AI GLM-5.1 coverage, MarkTechPost

GLM-5.1 hosted inference price, 2x to 4x V4-Pro on output

$0.98 to $1.40 input / $3.08 to $4.40 output per million tokens (provider-dependent)

llm-stats GLM-5.1 model page

Caveats

Output-token pricing is the load-bearing assumption; if your workload is read-heavy (huge input context, little generation), V3.2 at $0.252 input or even cheaper providers narrow the gap. Cost per resolved task is an inference from output price times benchmark resolve rate, not a measured per-task cost, and it assumes comparable token consumption per task across models; V4-Pro's larger reasoning traces could erode part of its edge. SWE-bench figures for both DeepSeek models are vendor-reported (the model card and the arXiv paper), not independently audited; the standard V3.2 primary is 73.1% with a 72 to 74% robustness range across Claude Code and RooCode harnesses. The GLM-5.1 capability figure is likewise self-reported by Z.ai with no published independent audit, so the rival comparison rests on two unaudited numbers. Terminal-Bench numbers are harness-dependent: V3.2's 39.6% is third-party Terminus 2 while DeepSeek's own paper claims 46.4% under Claude Code, so cross-harness comparisons are noisy. Note the earlier draft conflated three distinct DeepSeek checkpoints; the 74.5% Aider figure belongs to V3.2-Exp (a different model at $0.27/$0.41), not the priced standard V3.2. V4-Pro weights are permissive and commercial-use but training code and data are unreleased, so it is open-weight, not fully open-source. OpenRouter and Z.ai prices float and route across providers; spot-check before committing. Closed models (Claude Opus 4.7 at about 87.6%) score higher but are out of scope here.

A TF Verdict is TensorFeed's own analysis over cited public data, not a republished dataset. We take a clear position, show the evidence and the sources, and date-stamp the call because the answer can change. Disagree with a data point? Follow the source link and check it yourself.