TF Verdicts

Signed, opinionated rulings on specific AI-ecosystem questions, reasoned from cited data. We show the evidence so you can check the call.

Models·May 29, 2026·Medium confidence

What is the best-value open-weight model for coding agents today (as of 29 May 2026), judged on coding-agent benchmark resolve rate against hosted inference price, with output tokens as the binding cost?

As of 29 May 2026, DeepSeek V4-Pro is the best-value open-weight model for coding agents: at 80.6% on SWE-bench Verified and 67.9% on Terminal-Bench 2.0 it leads the open-weight field, and its now-permanent $0.435 in / $0.87 out per million tokens makes its cost per resolved task lower than DeepSeek V3.2 once you count failed and retried tasks (an inference from price plus resolve rate, assuming comparable token use per task), with GLM-5.1 the only close rival and DeepSeek V3.2 the pick when raw output price is the single constraint.

Read the verdict

Benchmarks·May 29, 2026·High confidence

Which AI benchmarks should you stop trusting for model selection?

Stop ranking frontier models on MMLU, the original GSM8K, HumanEval, and increasingly MMLU-Pro: they are saturated or contaminated and no longer discriminate. Select on contamination-resistant and held-out evals instead (SWE-bench Pro and Humanity's Last Exam for real spread, LiveCodeBench for its post-cutoff design, GPQA Diamond as a tiebreaker only). As of 29 May 2026.

Read the verdict

Compute·May 29, 2026·Medium confidence

Has frontier AI training-compute growth actually slowed?

No, not at the ceiling: as of late May 2026 frontier training compute is still climbing at roughly 4 to 5x per year and the biggest run on record keeps getting bigger, but the curve is bending below it as per-flagship total training compute flattens (and the slowdown likely sits in pretraining as labs reroute spend into reinforcement learning).

Read the verdict

Security·May 29, 2026·High confidence

Should AI-discovered CVEs be trusted like human-found ones?

No, not by default. Trust the pipeline that ships a working reproduction and a human gate; treat any unreviewed bulk AI finding as an unconfirmed lead, not a CVE, until someone reproduces it.

Read the verdict

Inference·May 29, 2026·Medium confidence

Is the proprietary frontier still worth its premium over open models for most agent tasks?

For most agent tasks the answer is no: route default traffic to open weights at the inference floor and reserve the frontier premium for long-horizon agentic coding and high-stakes reasoning, where a roughly 7 to 8 point benchmark gap actually compounds across a trajectory.

Read the verdict