A granular robots.txt strategy is the practice of selectively allowing or disallowing specific AI crawlers based on their downstream utility (training vs. retrieval) rather than applying a blanket "block AI" directive. The approach distinguishes between foundation crawlers (CCBot), which build the open datasets used by nearly all LLMs, and proprietary crawlers (GPTBot), which feed data to specific commercial models. It lets site owners balance data sovereignty with answer-engine visibility.
The Problem: The Nuclear Option Backfire
When website owners panic about their content being "stolen by AI," they often copy-paste a massive block list into their robots.txt file.
Disallow: /
Or they block the biggest name they know: CCBot.
This is a strategic error. To understand why, understand the data supply chain.
A non-profit foundation crawler. It takes a snapshot of the entire internet and dumps it into a publicly available dataset (WARC files).
Who uses it? Everyone. OpenAI, Anthropic, Google, Apple, academic researchers all download Common Crawl to pre-train their base models.
The risk: block CCBot and you remove your site from the entire future history of the internet. You're not just blocking ChatGPT. You're blocking the base layer of knowledge for models that haven't been invented yet.
A proprietary crawler. It scrapes data specifically to fine-tune OpenAI's models.
The risk: block GPTBot and you only hurt OpenAI. You don't hurt Anthropic (Claude) or Google (Gemini).
The consequence of blanket blocking: block CCBot and you erase your brand from the foundational training data of the next generation of AI. When a user asks a future model "Who is the leader in [Your Industry]?", the model won't hallucinate. It will simply have zero tokens associated with your brand. You become a digital ghost, the long-term version of the invisibility we map in the ChatGPT visibility checklist.
The Solution: The Surgical Block
The optimal strategy for most commercial brands is surgical permissiveness. You want to be in the foundation (so models know you exist) but you may want to opt out of proprietary training (if you sell content) or live retrieval (if you want users to click through). For AEO, we generally recommend allowing retrieval bots while strictly managing training bots if you're protecting IP.
The 3-Tier Bot Taxonomy
Categorize bots in your robots.txt:
CCBot. Blocking this destroys your long-term Entity Home authority across all models.
GPTBot, ClaudeBot, FacebookBot. These scrape content to build products. If you sell data, block them. If you sell services, allow them for visibility.
OAI-SearchBot, ChatGPT-User, PerplexityBot. These bots act like users. They fetch your page in real time to answer a question. Blocking these is equivalent to blocking a user from visiting your site.
Technical Implementation: The Granular File
Don't rely on default settings. Explicitly define your stance.
You want every model to know who you are, and you want every RAG agent to cite you.
You want to be in the foundation (AI knows you exist) but refuse to let OpenAI train on your latest articles for free. Crucially, you still allow the search bot so users can find you.
OpenAI has split their bot definitions. GPTBot is for training. OAI-SearchBot is for SearchGPT. This separation lets you achieve the exact granularity you need.
CCBot vs. GPTBot
Feature | CCBot (Common Crawl) | GPTBot (OpenAI) |
Owner | Non-profit organization | OpenAI (commercial) |
Purpose | Archiving the web (open data) | Training proprietary models |
Downstream Usage | Used by ALL AIs (OpenAI, Anthropic, Meta) | Used ONLY by OpenAI |
Blocking Impact | Removes you from the global base layer | Removes you from GPT training |
Traffic Referral | Zero (it's an archive) | Low (it's a training scraper) |
AEO Risk | Extreme (total invisibility long-term) | Moderate (invisible to GPT only) |
Code Example: The AEO-Safe robots.txt
A modern robots.txt template that protects against empty shell issues (by allowing inspection) while managing crawler access.
We included the llms.txt reference here, covered in our guide on the /llms.txt standard.
Check which AI crawlers your site is accidentally blocking.
Free audit. Reads your robots.txt and WAF rules and flags blocked foundation bots, missing retrieval allowances, and accidental opt-outs.
Audit your crawler access →Key Takeaways
- Common Crawl is the root. CCBot isn't just another crawler. It's the library of record for the AI age. Blocking it is a permanent opt-out from the general intelligence of future models.
- Granularity is power. OpenAI split GPTBot (training) and OAI-SearchBot (search) for a reason. Use this distinction to protect your IP while keeping your traffic.
- Robots.txt is law. Unlike meta tags, which can be ignored, reputable AI companies strictly adhere to robots.txt directives.
- Audit your WAF. Sometimes your robots.txt is perfect, but your Cloudflare/WAF is blocking unknown bots by default. Ensure CCBot is whitelisted in your firewall.
- Monitor with logs. Use server logs to see if ChatGPT-User is visiting your high-value pages. If not, check your Share of Model metrics.
References & Further Reading
- Common Crawl: CCBot Documentation. Official specifications for the Common Crawl bot and its user-agent string.
- OpenAI: Bot Names and User Agents. The official list distinguishing between GPTBot, ChatGPT-User, and OAI-SearchBot.
- Dark Visitors: AI Agent List. A database of active AI scrapers and their behaviors.

