TF Verdict·Inference·May 29, 2026·Medium confidence

Is the proprietary frontier still worth its premium over open models for most agent tasks?

The verdict

For most agent tasks the answer is no: route default traffic to open weights at the inference floor and reserve the frontier premium for long-horizon agentic coding and high-stakes reasoning, where a roughly 7 to 8 point benchmark gap actually compounds across a trajectory.

My ruling, as of 29 May 2026: for most agent tasks the frontier premium is no longer worth it. Default to open weights at the inference floor and spend the premium only where it compounds.

Look at the spread. DeepSeek V4-Flash, open and MIT-licensed, runs $0.14 in and $0.28 out per million tokens. Claude Opus 4.7 runs $5 and $25. That is roughly 90x on output. GPT-5.4 and Gemini 3.1 Pro sit in the middle near $2 to $2.50 in and $12 to $15 out, still 40x to 50x the open floor. GPT-5.5 actually climbed to $5 / $30, dearer than Opus, which only makes my point louder.

Now the capability gap on the work agents actually do. On the AA-GPQA Diamond aggregate the top frontier lands at 92 to 94 percent (Gemini 3.1 Pro 94.1, GPT-5.5 93.5, GPT-5.4 92.0) and the best open weight, Kimi K2.6, at 91.1. About 3 points. No one should pay a 40x to 90x bill for retrieval, summarization, extraction, classification, or routine tool calls when the open floor lands that close.

Here is where the frontier still earns it: long-horizon agentic coding. SWE-bench Verified shows GPT-5.5 at 88.7 and Opus 4.7 at 87.6 versus 80.6 for the best open weight, DeepSeek V4 Pro Max. Roughly 8 points against GPT-5.5, about 7 against Opus. In a multi-step loop those per-step gaps tend to compound, though that is directional, not arithmetic; steps correlate and some loops self-correct.

Bottom line: open by default, frontier for the hard agentic loop.

The evidence

The data points behind this verdict. Each is cited so you can check the call against its source.

DeepSeek V4-Flash open-weight API price (the practical open inference floor), MIT-licensed and self-hostable

$0.14 input / $0.28 output per million tokens

DeepSeek API Docs (official pricing)

Claude Opus 4.7 frontier API price, the premium end of the market

$5.00 input / $25.00 output per million tokens

Anthropic / Claude API pricing

GPT-5.5 and GPT-5.4 frontier pricing (the figures used in the argument)

GPT-5.5 $5.00 in / $30 out; GPT-5.4 $2.50 in / $15 out per million tokens

DevTk OpenAI API Pricing Guide 2026

Gemini 3.1 Pro mid-frontier pricing (context windows up to 200K tokens)

$2.00 input / $12.00 output per million tokens

DevTk Gemini 3.1 Pro model pricing (May 2026)

GPQA Diamond reasoning gap between top frontier and best open weight is single digits

Gemini 3.1 Pro 94.1%, GPT-5.5 93.5%, GPT-5.4 92.0% vs best open weight Kimi K2.6 91.1%

BenchLM AA-GPQA Diamond aggregate (updated May 28, 2026)

On agentic coding the frontier lead is real: best open weight trails frontier by roughly 7 to 8 points on SWE-bench Verified

Frontier ~88% (GPT-5.5 88.7%, Opus 4.7 87.6%) vs best open weight DeepSeek V4 Pro Max 80.6%

SWE-bench Verified leaderboard (May 2026), marc0.dev

Self-host H100 cloud rental sits in the low single digits per GPU-hour at specialized providers, so self-hosting open weights can beat hosted API economics past a sustained-volume threshold

H100 cloud rental roughly $1.38 (specialized providers) up to ~$11 per GPU-hour (hyperscalers), median near $3.50 (May 2026)

Thunder Compute NVIDIA H100 Pricing (May 2026)

Caveats

Benchmark leaderboards move week to week and several figures come from aggregator sites (BenchLM, marc0.dev) rather than first-party model cards, so re-verify the exact percentages against your own task before committing budget; hence medium confidence. GPQA scores in particular are highly source-dependent: Opus 4.7 is reported anywhere from 88.5% (BenchLM base) to 94.2% (some blogs), so treat per-model points as a band, not a fixed number. On the cited sources DeepSeek and Kimi open-weight models trade the open-tier lead depending on benchmark (Kimi K2.6 tops the BenchLM GPQA aggregate at 91.1%, DeepSeek V4 Pro Max tops the marc0.dev SWE-bench table at 80.6%), so the open baseline is a moving target. The trajectory-level coding advantage assumes roughly independent per-step success; in practice errors correlate and some agent loops self-correct, so the compounding is directional, not a guaranteed multiplier. The self-host break-even is a rule of thumb: H100 rental ranges from the low single digits per GPU-hour at specialized providers up to roughly $11 at hyperscalers (median near $3.50 in May 2026), so self-hosting open weights only wins past sustained, high-utilization volume; below that, hosted open APIs are simpler. "Open weight" here means downloadable and self-hostable, not necessarily open data or training; license terms still govern commercial use.

A TF Verdict is TensorFeed's own analysis over cited public data, not a republished dataset. We take a clear position, show the evidence and the sources, and date-stamp the call because the answer can change. Disagree with a data point? Follow the source link and check it yourself.

All TF Verdicts Back to Feed