Question 1

What does this page measure?

Accepted Answer

For a curated universe of roughly 300 agent-relevant domains, we read each site's public robots.txt and assign a per-bot verdict (allowed, blocked, partial, or unknown) for 14 named AI crawlers. We also check whether the site publishes an llms.txt or ai.txt file. The numbers on this page are the aggregate of those per-site verdicts: what share of sites block GPTBot at the root, how many publish llms.txt, and so on.

Question 2

Does a blocked verdict mean the bot actually cannot crawl the site?

Accepted Answer

No. We report stated policy, not enforcement. robots.txt is a request, not a wall. A site can list Disallow for ClaudeBot and that bot can still ignore it; a site can stay silent and a well-behaved bot will still crawl. We are documenting what each site declares in its robots.txt, which is the signal publishers control and the one most agents are supposed to honor. Whether a given crawler complies is a separate question we do not claim to answer.

Question 3

How do you turn robots.txt into a verdict?

Accepted Answer

Deterministically. We pick the most specific matching user-agent group (an exact token match beats the wildcard group), then look at root access. A Disallow of the root with no equal-or-longer Allow override is blocked. A non-root Disallow with no root disallow is partial. An empty Disallow, an Allow that overrides, or no matching group at all is allowed (absence of a rule is permission, per the standard). If we cannot read robots.txt at all (timeout, network error, non-2xx), every bot for that domain is recorded as unknown, never as allowed.

Question 4

Why does unknown matter, and how is it counted?

Accepted Answer

Honesty in the denominator. Blocked and allowed percentages are computed only over known verdicts. Domains where we could not read robots.txt are excluded from the math entirely rather than silently folded into allowed. That keeps the headline percentages from drifting just because a few sites were briefly unreachable.

Question 5

How fresh is the data, and why does coverage fill in gradually?

Accepted Answer

The crawl is rolling: each daily run refreshes about one seventh of the universe, so the full set re-checks roughly weekly and the first snapshot fills in over about a week. The premium endpoints carry an eight-day freshness SLA, and if the snapshot is stale past that window the request is not charged. The summary on this page reports domains with data versus domains tracked so you can see the coverage build in real time.

Question 6

What is llms.txt and ai.txt, and why track them here?

Accepted Answer

They are emerging, opt-in conventions at a site root for telling AI systems what they may use and how. llms.txt points models at the content a site wants surfaced; ai.txt is used by some sites to declare AI usage terms. Tracking their adoption alongside robots.txt verdicts gives a fuller picture of how the open web is choosing to engage with AI crawlers: not just who is blocked, but who is actively inviting.

Question 7

Which bots do you track?

Accepted Answer

Fourteen named AI crawlers: GPTBot, OAI-SearchBot, ChatGPT-User, ClaudeBot, Claude-SearchBot, Claude-User, PerplexityBot, CCBot, Google-Extended, Bytespider, Amazonbot, Applebot-Extended, Meta-ExternalAgent, cohere-ai. The list covers the major training, search, and user-triggered agents from OpenAI, Anthropic, Perplexity, Common Crawl, Google, ByteDance, Amazon, Apple, Meta, and Cohere.

Question 8

How do I pull this programmatically?

Accepted Answer

Two free endpoints, no auth. GET /api/ai-crawler-access/summary.json for the aggregate, and GET /api/ai-crawler-access/site?domain=example.com for one site. The full dataset and the historical change log (when a site flips a bot from allowed to blocked, or publishes llms.txt) are premium at one credit each.

AI Crawler Access Map

Methodology and honest limits

Stated policy, not enforcement

Unknown is never allowed

Rolling daily crawl

Freshness SLA

Every tracked domain, by sector

Publishing75

Developer docs70

SaaS70

AI companies66

AI media60

E-commerce55

Reference55

Government50

Free agent endpoints

Related on TensorFeed

Frequently asked questions