TF Verdict·Benchmarks·May 29, 2026·High confidence

Which AI benchmarks should you stop trusting for model selection?

The verdict

Stop ranking frontier models on MMLU, the original GSM8K, HumanEval, and increasingly MMLU-Pro: they are saturated or contaminated and no longer discriminate. Select on contamination-resistant and held-out evals instead (SWE-bench Pro and Humanity's Last Exam for real spread, LiveCodeBench for its post-cutoff design, GPQA Diamond as a tiebreaker only). As of 29 May 2026.

The TF Verdict, as of 29 May 2026: stop ranking frontier models on MMLU, the original GSM8K, and HumanEval. MMLU-Pro is next on the chopping block. Select on contamination-resistant and held-out evals instead.

Here is the math. MMLU is saturated, with top models near 92.4 (GPT-5.5) against a human-expert baseline of about 89.8. When the field clusters past the humans, the gaps you see are noise, not signal. MMLU-Pro was the patch, and it already reads around 90 (Gemini 3 Pro at roughly 90.1). The patch is wearing out.

HumanEval belongs on the same pile. Frontier models clear pass@1 above 95 percent on it, so it stopped separating anyone years ago. Treat a high HumanEval score as table stakes, not a ranking.

Contamination is the uglier problem. Scale's GSM1k study rebuilt grade-school math from scratch and watched Phi and Mistral families drop up to 13 points. That is memorization wearing a reasoning costume. Frontier labs barely flinched there, which is exactly why GSM1k still has signal and the original GSM8K does not.

GPQA Diamond is the trap people still fall for. Its top four span about half a point, one question on a 198-item test. That is a tiebreaker, not a leaderboard.

The clearest tell: Claude Opus 4.5 scores 80.9 on SWE-bench Verified and 45.9 on SWE-bench Pro. A 35-point gap, which is why labs now cite Pro right next to Verified.

What we trust: SWE-bench Pro for agentic coding, and Humanity's Last Exam, where the top still spreads across real gaps (45.7, 44.7, 44.3). LiveCodeBench earns its spot for a different reason. Its leader is near 92 too, so it is not the low scores; it only ever scores post-cutoff problems, so contamination cannot pile up over time.

Caveat: saturated does not mean useless. A model that flunks MMLU is still disqualified. Use the dead ones to screen out, never to crown a winner. Run your own task-specific eval before you commit budget. The leaderboard is marketing; your workload is the truth.

The evidence

The data points behind this verdict. Each is cited so you can check the call against its source.

MMLU is saturated for frontier models, which cluster at 90 to 92 percent and no longer discriminate; the current MMLU leader GPT-5.5 reports 92.4 percent.

saturated, top ~92.4% (GPT-5.5)

TokenMix, GPT-5.5 Review (2026)

MMLU-Pro, built to escape MMLU saturation, is itself nearing 90 percent at the top, with Gemini 3 Pro around 90.1 percent.

Gemini 3 Pro ~90.1% on MMLU-Pro

IntuitionLabs, MMLU-Pro Explained

HumanEval is saturated: frontier models clear pass@1 above 95 percent, so it no longer separates the top tier and both leading labs note its ceiling.

frontier pass@1 above 95%, saturated

OpenAI / Anthropic model documentation on HumanEval saturation

Scale's GSM1k study found accuracy drops of up to 13 percentage points on fresh, style-matched problems for Phi and Mistral families, evidence of overfitting to the original GSM8K; frontier labs (GPT, Claude) showed minimal overfit.

up to 13pp drop, Phi/Mistral overfit most, frontier minimal

Scale Labs, A Careful Examination of LLM Performance on Grade School Arithmetic

GPQA Diamond has reached near-ceiling clustering: the top four models span about 0.5 percentage points, roughly one question on a 198-question test, making them statistically indistinguishable.

top 4 within ~0.5pp, ~1 question on 198

Artificial Analysis, GPQA Diamond Leaderboard

SWE-bench Verified no longer separates the top tier: Claude Opus 4.5 scores 80.9 percent on Verified but only 45.9 percent on SWE-bench Pro, a 35-point gap, which is why labs now cite Pro alongside Verified.

80.9% Verified vs 45.9% Pro (~35pp gap)

CodeAnt, SWE-bench Leaderboard 2026

Humanity's Last Exam still produces real spread among frontier models: the leader scores about 45.7 percent (Claude Opus 4.8), with Gemini 3.1 Pro Preview at 44.7 and GPT-5.5 at 44.3, on a held-out frontier set.

HLE leader ~45.7%, Gemini 3.1 Pro ~44.7%, GPT-5.5 ~44.3%

Artificial Analysis, Humanity's Last Exam Leaderboard

LiveCodeBench resists contamination by design rather than by low scores: it only scores post-cutoff problems, so memorization cannot accumulate over time, though its leader is itself near-saturated (about 91.7 percent for Gemini 3 Pro Preview).

post-cutoff design; leader ~91.7% (Gemini 3 Pro Preview)

LiveCodeBench: Holistic and Contamination Free Evaluation (arXiv 2403.07974); leader figure via PricePerToken

Caveats

All leaderboard figures are point-in-time (29 May 2026) and vary by harness, prompt, and reasoning setting; several come from secondary aggregators (TokenMix, IntuitionLabs, CodeAnt, PricePerToken) rather than the labs' own model cards, and the HLE numbers are from Artificial Analysis's text-only subset. The GPQA Diamond ~0.5pp top-four cluster is sourced to the Artificial Analysis leaderboard, which shows the tight spread; some other writeups (such as IntuitionLabs' own table) show a wider 2 to 3 point spread depending on which models and harness they include. The LiveCodeBench post-cutoff, contamination-free design is documented in the benchmark's own paper and site, with PricePerToken cited only for the current leader figure. The GSM1k 13-point figure is from Scale's 2024 paper and applies mainly to smaller open-weight families; frontier labs (GPT, Claude) showed minimal overfit, which is why GSM1k retains signal. The SWE-bench Verified-versus-Pro gap reflects harder task design as well as possible contamination, so the 35-point gap is not pure leakage; note that OpenAI still does report SWE-bench Verified (GPT-5.5 reports 88.7 percent), so the issue is that Verified no longer separates the top tier, not that labs abandoned it. LiveCodeBench's leader is itself near-saturated, so it is recommended for its contamination-resistant post-cutoff design, not for low headline scores. None of this means saturated benchmarks are worthless as a floor or disqualifier; the ruling is specifically against using them to rank frontier models against each other. Always finish with your own task-specific eval.

A TF Verdict is TensorFeed's own analysis over cited public data, not a republished dataset. We take a clear position, show the evidence and the sources, and date-stamp the call because the answer can change. Disagree with a data point? Follow the source link and check it yourself.

All TF Verdicts Back to Feed