Benchmarks

Free

GET /api/benchmarks

The /api/benchmarks endpoint returns benchmark scores for major AI models across SWE-bench (real software engineering tasks), MMLU-Pro (general reasoning), HumanEval (code generation), GPQA Diamond (graduate science), and MATH (competition math). Updated weekly as new scores publish.

When to use this endpoint

When your agent needs to compare model capability on a specific dimension. For per-benchmark leaderboard views see /benchmarks/[name]; for time-series of one model on one benchmark, use /api/premium/history/benchmarks/series.

Example response

{
  "ok": true,
  "lastUpdated": "2026-04-24",
  "benchmarks": [
    { "id": "swe_bench", "name": "SWE-bench", "description": "Real GitHub issue resolution", "maxScore": 100 }
  ],
  "models": [
    {
      "model": "Claude Opus 4.7",
      "provider": "Anthropic",
      "scores": { "swe_bench": 65.4, "mmlu_pro": 93.8, "human_eval": 96.2 }
    }
  ]
}

Code samples

Python SDK

from tensorfeed import TensorFeed

tf = TensorFeed()
b = tf.benchmarks()
# Sort models by SWE-bench desc
ranked = sorted(b["models"], key=lambda m: m["scores"].get("swe_bench", 0), reverse=True)

TypeScript SDK

import { TensorFeed } from 'tensorfeed';

const tf = new TensorFeed();
const { models } = await tf.benchmarks();
const top = models
  .filter(m => m.scores.swe_bench)
  .sort((a, b) => b.scores.swe_bench - a.scores.swe_bench);

FAQ

Where do the benchmark scores come from?

Published scores from each benchmark's official leaderboard plus, where applicable, vendor-published numbers verified against the test methodology. We do not run independent benchmark evaluations.

How current are the benchmark scores?

Updated weekly via the daily catalog cron. New model launches typically land within a few days of public score publication.

Benchmarks

When to use this endpoint

Example response

Code samples

Python SDK

TypeScript SDK

FAQ

Where do the benchmark scores come from?

How current are the benchmark scores?

Related endpoints

Models

Routing Recommendations

Benchmark Series

Compare Models