LIVE
ANTHROPICOpus 4.7 benchmarks published2m ago
CLAUDEOK142ms
OPUS 4.7$15 / $75per Mtok
CHATGPTOK89ms
HACKERNEWSWhy has not AI improved design quality the way it improved dev speed?14m ago
MMLU-PROleader Opus 4.788.4
GEMINIDEGRADED312ms
MISTRALMistral Medium 3 released6m ago
GPT-4o$5 / $15per Mtok
ARXIVCompositional reasoning in LRMs22m ago
BEDROCKOK178ms
GEMINI 2.5$3.50 / $10.50per Mtok
THE VERGEFrontier Model Forum expansion announced38m ago
SWE-BENCHleader Claude Opus 4.772.1%
MISTRALOK104ms
ANTHROPICOpus 4.7 benchmarks published2m ago
CLAUDEOK142ms
OPUS 4.7$15 / $75per Mtok
CHATGPTOK89ms
HACKERNEWSWhy has not AI improved design quality the way it improved dev speed?14m ago
MMLU-PROleader Opus 4.788.4
GEMINIDEGRADED312ms
MISTRALMistral Medium 3 released6m ago
GPT-4o$5 / $15per Mtok
ARXIVCompositional reasoning in LRMs22m ago
BEDROCKOK178ms
GEMINI 2.5$3.50 / $10.50per Mtok
THE VERGEFrontier Model Forum expansion announced38m ago
SWE-BENCHleader Claude Opus 4.772.1%
MISTRALOK104ms
All endpoints

Benchmark Series

1 credit
GET /api/premium/history/benchmarks/series

The benchmark series endpoint returns the daily score evolution for a single benchmark on a single model. Useful for tracking whether a model is improving (provider released a fine-tune), regressing (provider downgraded the API endpoint to a smaller model), or holding steady.

When to use this endpoint

When a research agent needs to track a benchmark trajectory. For a snapshot leaderboard at a single date, see /benchmarks/[name].

Parameters

NameInTypeDescription
model*querystringModel id or display name
benchmark*querystringBenchmark key (swe_bench, mmlu_pro, gpqa_diamond, math, human_eval)
fromquerystringStart date YYYY-MM-DD UTC
toquerystringEnd date YYYY-MM-DD UTC

* required

Example response

{
  "ok": true,
  "model": "Claude Opus 4.7",
  "benchmark": "swe_bench",
  "points": [
    { "date": "2026-04-01", "score": 70.0 },
    { "date": "2026-04-27", "score": 73.4 }
  ],
  "summary": { "first": { "score": 70.0 }, "latest": { "score": 73.4 }, "delta_pp": 3.4 }
}

Code samples

Python SDK

from tensorfeed import TensorFeed

tf = TensorFeed(token="tf_live_...")
s = tf.benchmark_series(model="Claude Opus 4.7", benchmark="swe_bench")
print(f"SWE-bench moved {s['summary']['delta_pp']} pp")

TypeScript SDK

import { TensorFeed } from 'tensorfeed';

const tf = new TensorFeed({ token: 'tf_live_...' });
const s = await tf.benchmarkSeries({ model: 'Claude Opus 4.7', benchmark: 'swe_bench' });

MCP tool

Available via the TensorFeed MCP server as benchmark_series. Add npx -y @tensorfeed/mcp-server to your Claude Desktop or Claude Code MCP config.

FAQ

Why would a benchmark score change over time on the same model?

Three common reasons: provider released a fine-tune or new system prompt and updated the score, the test methodology changed (e.g. SWE-bench Verified subset got new tasks), or the score was recalculated against a different harness. Tracking the trajectory surfaces these changes.

Related endpoints