AI Crawler Access Map
Which AI bots a curated universe of agent-relevant domains allow or block in their robots.txt, plus who publishes llms.txt and ai.txt. The open web is quietly deciding which crawlers it lets in. This is the running tally.
Machine-readable JSON/api/ai-crawler-access/summary.jsonEvery day we read the public robots.txt of part of the universe, assign a per-bot verdict, and check for llms.txt and ai.txt at the root. The aggregate below is live: it polls the same free endpoint agents use, so the human view and the machine view never disagree. Pair it with /agent-traffic (which bots actually hit us) to see both sides of the crawler relationship.
Methodology and honest limits
Stated policy, not enforcement
robots.txt is a declaration, not a barrier. A blocked verdict means the site asks that crawler not to index the root; it does not prove the crawler obeys, and it does not mean the content is technically unreachable. We document the policy publishers control. Compliance is voluntary and outside what this feed claims to measure.
Unknown is never allowed
When robots.txt cannot be read (timeout, network error, or a non-2xx response), every bot for that domain is recorded as unknown, not allowed. Blocked and allowed percentages are computed only over known verdicts, so a brief outage on a few sites does not quietly inflate the allowed share.
Rolling daily crawl
Each daily run refreshes about one seventh of the universe, so every domain is re-checked roughly weekly and the first snapshot fills in over about a week. The summary reports domains with data versus domains tracked so coverage is never overstated while the map seeds.
Freshness SLA
The premium full dataset and change log carry an eight-day freshness SLA. If the snapshot is older than that window, the request is not charged. The captured-at timestamp on every response reflects the real data-capture time, never the wall-clock moment you called.
Free agent endpoints
/api/ai-crawler-access/summary.jsonAggregate: per-bot blocked and allowed percentages, llms.txt and ai.txt adoption, and a per-sector rollup. No parameters. Same payload this page renders./api/ai-crawler-access/site?domain=One domain: the per-bot robots.txt verdict plus llms.txt and ai.txt presence. Required param: domain.
The full dataset and the historical change log are premium at one credit each. See /developers/agent-payments.
Related on TensorFeed
Frequently asked questions
- What does this page measure?
- For a curated universe of roughly 300 agent-relevant domains, we read each site's public robots.txt and assign a per-bot verdict (allowed, blocked, partial, or unknown) for 14 named AI crawlers. We also check whether the site publishes an llms.txt or ai.txt file. The numbers on this page are the aggregate of those per-site verdicts: what share of sites block GPTBot at the root, how many publish llms.txt, and so on.
- Does a blocked verdict mean the bot actually cannot crawl the site?
- No. We report stated policy, not enforcement. robots.txt is a request, not a wall. A site can list Disallow for ClaudeBot and that bot can still ignore it; a site can stay silent and a well-behaved bot will still crawl. We are documenting what each site declares in its robots.txt, which is the signal publishers control and the one most agents are supposed to honor. Whether a given crawler complies is a separate question we do not claim to answer.
- How do you turn robots.txt into a verdict?
- Deterministically. We pick the most specific matching user-agent group (an exact token match beats the wildcard group), then look at root access. A Disallow of the root with no equal-or-longer Allow override is blocked. A non-root Disallow with no root disallow is partial. An empty Disallow, an Allow that overrides, or no matching group at all is allowed (absence of a rule is permission, per the standard). If we cannot read robots.txt at all (timeout, network error, non-2xx), every bot for that domain is recorded as unknown, never as allowed.
- Why does unknown matter, and how is it counted?
- Honesty in the denominator. Blocked and allowed percentages are computed only over known verdicts. Domains where we could not read robots.txt are excluded from the math entirely rather than silently folded into allowed. That keeps the headline percentages from drifting just because a few sites were briefly unreachable.
- How fresh is the data, and why does coverage fill in gradually?
- The crawl is rolling: each daily run refreshes about one seventh of the universe, so the full set re-checks roughly weekly and the first snapshot fills in over about a week. The premium endpoints carry an eight-day freshness SLA, and if the snapshot is stale past that window the request is not charged. The summary on this page reports domains with data versus domains tracked so you can see the coverage build in real time.
- What is llms.txt and ai.txt, and why track them here?
- They are emerging, opt-in conventions at a site root for telling AI systems what they may use and how. llms.txt points models at the content a site wants surfaced; ai.txt is used by some sites to declare AI usage terms. Tracking their adoption alongside robots.txt verdicts gives a fuller picture of how the open web is choosing to engage with AI crawlers: not just who is blocked, but who is actively inviting.
- Which bots do you track?
- Fourteen named AI crawlers: GPTBot, OAI-SearchBot, ChatGPT-User, ClaudeBot, Claude-SearchBot, Claude-User, PerplexityBot, CCBot, Google-Extended, Bytespider, Amazonbot, Applebot-Extended, Meta-ExternalAgent, cohere-ai. The list covers the major training, search, and user-triggered agents from OpenAI, Anthropic, Perplexity, Common Crawl, Google, ByteDance, Amazon, Apple, Meta, and Cohere.
- How do I pull this programmatically?
- Two free endpoints, no auth. GET /api/ai-crawler-access/summary.json for the aggregate, and GET /api/ai-crawler-access/site?domain=example.com for one site. The full dataset and the historical change log (when a site flips a bot from allowed to blocked, or publishes llms.txt) are premium at one credit each.