One Day, Eight New Free APIs: The Free-Data-First Sprint
Today started with an audit that killed two paid endpoints. It ended with eight new free ones live. Somewhere in the middle the strategy crystallized: TensorFeed is going free-data-first. The premium tier is going to be the reasoning we add on top of clean public data, not a gate around the data itself. This is the post-mortem of how that pivot looked in eighteen commits and the rubric that made it possible.
The morning: kill what we cannot defend
The day opened with a redistribution-rights audit of every premium endpoint we sell. Sixteen endpoints, eight green, six yellow, two red. The two reds were a GPU-pricing series sourcing Vast.ai (their ToS prohibits redistribution outright) and an LLM benchmarks series merging in the HuggingFace Open LLM Leaderboard (HF retains rights on their compiled leaderboard, even though benchmark scores themselves are facts).
Both got cut by lunch. Vast was removed entirely, the GPU pricing series moved from premium to free (factual price data has lower legal optics on free anyway), and the benchmarks ingest was rebuilt on hand-curated vendor evals (Anthropic model cards, OpenAI eval tables, Google AI blog, Meta Llama benchmarks). The full post-mortem for that part of the day is here.
The audit produced something more durable than the cleanup itself: a three-bucket grader. Every upstream gets graded green (license explicitly permits paid redistribution, or first-party / public-domain factual), yellow (commercial use OK but redistribution unclear or limited, RSS-style fair-use territory), or red (prohibits redistribution outright or requires a paid license we don't have). That grader was the load-bearing piece for the rest of the day.
The afternoon: stack clean sources
With the rubric proven and the pattern repeatable, the rest of the day was about velocity on legally-clean sources. Eight new free endpoints landed:
- /api/sports/nfl + /api/sports/mlb: 32 NFL teams + 30 MLB teams (factual, public-domain), aggregated news from ESPN/NFL.com/MLB.com/CBS/Yahoo, and for NFL: players + schedule from nflverse-data (CC-BY-4.0).
- /api/gpu/pricing: Lambda Labs added as a second source after Vast.ai removal. Their public pricing page is permissive ToS, monthly cadence, marketing-stable.
- /api/packages/npm/ai-trending + /api/packages/pypi/ai-trending: ~78 curated AI/ML packages across the npm and PyPI ecosystems, ranked by recent downloads. Sources: documented public npm downloads API and pypistats.org (Linehaul / PyPI BigQuery public dataset).
- /api/research/institutions/ai: Top 100 institutions worldwide ranked by AI-tagged publications in the last 365 days. Source: OpenAlex (CC0 public domain).
- /api/economy/bls/indicators + /api/economy/fred/indicators: Curated 20-series macro matrix. BLS owns labor + prices + jobs (CPI, unemployment, payrolls, JOLTS). FRED owns rates + money + commodities + dollar (fed funds, treasuries, GDP, M2, mortgage rate, USD index, oil). Both public-domain US government data.
- /api/policy/ai/registry: Editorial registry of significant AI policy actions across six jurisdictions (US Federal, US State, EU, UK, China, International). Sixteen entries from Biden EO 14110 through the EU AI Act phased rollout to the Bletchley and Seoul declarations.
Every endpoint follows the same shape: structured attribution block in the response payload that names the source, the license, the policy, and the upstream URL. Agents read the legal posture from the wire format, not from our docs.
The rubric that made eight in a day possible
Eight clean sources is a lot of upstream surface to verify. The rubric that made it tractable is six steps, in order:
- Verify the upstream ToS. Read the actual terms of service. Quote the relevant clauses. Do not assume “public API” means “free to redistribute commercially”; many public-API lists conflate the two and Sleeper's ToS is a perfect example of how wrong that assumption can be.
- Three-bucket grade. Green / yellow / red. Green ships. Yellow ships with mitigation (RSS-style snippet caps, mandatory link, source field). Red does not ship.
- Curated seed where applicable. Don't try to be a full mirror of the upstream. For npm and PyPI we hand-curated the AI/ML slice. For BLS and FRED we picked the high-signal series. Curation is editorial; the underlying data is factual.
- Fetch and KV-write. Daily cron in most cases (research, packages, economic indicators), hourly for news, none for editorial registries. Each source picks the cadence that matches both the upstream update cadence and our KV-ops budget.
- Structured attribution in the response shape.Every endpoint ships an
attributionblock. Agents see source, license, license URL, and policy in the payload itself. - Tests + meta + llms.txt. Pure-logic unit tests on the parser and read paths. Add the new endpoint to
/api/metaso the discovery surface stays honest. Add it to/llms.txtso the agents reading our llms.txt see it.
That rubric ran six times today (sports V1, sports V2, npm, OpenAlex, BLS, MLB, policy, PyPI, FRED, plus a follow-up admin-trigger commit). Each individual instance took 1 to 1.5 hours of careful work. Doing it in parallel sessions would have been even faster, but going slow and right meant zero rework: 742 worker tests passing, zero compile errors, zero failed deploys.
Why free-data-first is the actual strategy
The version of TensorFeed that gates raw data behind paywalls is the version that loses. Open data is everywhere; if we charge for what someone can get from fred.stlouisfed.org or openalex.org or api.npmjs.org with one extra click, we are friction, not value.
The version that wins is the one that takes those eight upstreams and makes them agent-shaped. Same JSON envelope, same attribution block, same filter syntax across economy, research, packages, sports, policy. Same predictable rate-limit posture, same predictable refresh cadence. An agent that learns how one TensorFeed endpoint works has effectively learned how all of them work.
That uniformity is the moat. The premium tier becomes the place we add reasoning on top: cross-source joins, weekly trend metrics, capability heatmaps, cost-optimization recommendations, watches, alerts. The compute is the value, not the gate.
And paying for clean data is the wrong frame anyway. Most of what agents actually need is public-domain or permissive-license data. The hard part is not paying for access; the hard part is the eight upstream ToS reads and the three-bucket grader and the structured attribution in every response. We have done that work. Eight times today, in fact.
What ships next
The crons we wired today fire overnight. By tomorrow morning the npm, PyPI, OpenAlex, BLS, FRED, and nflverse endpoints all have their first populated snapshots. Lambda Labs is already live. Sports news is already polling hourly. Eight new endpoints, eight new clean upstreams, all attributed in the response shape.
The next session is the first premium derived endpoint. Something like /api/premium/macro/digest that joins BLS and FRED into a single agent-shaped morning brief: rates trend, inflation trend, employment trend, week-over-week movement, in one paid call. That validates the “premium = compute we add” thesis on the foundation we built today.
After that, more sources: USPTO patents, Wikidata, more sports leagues, more macro series. The rubric ran six times today and broke zero times. Tomorrow it runs again.
The eight new endpoints are live now under /sports, /economy, /research/institutions, /packages, and /policy. The full audit history (today's eighteen commits) is on the public repo. Every paid endpoint we still sell carries a structured attribution block telling you where the data came from. That is the brand now: free data, clean licenses, agent-shaped, transparent.