๐ค robots.txt Builder
The 2010 robots.txt only knew Googlebot, Bingbot, and a few others. In 2026 you need to decide separately for each AI crawler whether they can read your site and use it for training, citation, or both. This builder makes those rules explicit.
1. Standard search engines
2. AI crawlers โ pick a policy
Each AI vendor uses one or more named bots. The general pattern: allow = my content can be used for both training and citation; noindex in their training model but still crawl for citation; block = don't crawl at all.
| Bot | Vendor / Purpose | Policy |
|---|
3. Per-path rules
One per line. Default applies to all user-agents unless
a bot-specific override is given. Examples: /admin/, /private/, /api/
4. Sitemap URL
Output
Cheat-sheet: what each bot does
- GPTBot โ OpenAI's training crawler. Blocking it stops your content from being used to train future GPT models.
- ChatGPT-User โ fetches when a ChatGPT user explicitly clicks "browse with web". Allow if you want ChatGPT users to cite you.
- OAI-SearchBot โ OpenAI's search index for ChatGPT search.
- ClaudeBot โ Anthropic's training crawler. Same trade-off as GPTBot.
- Claude-Web โ fetches when a Claude user clicks a link or asks Claude to read a URL.
- anthropic-ai โ older Anthropic user agent, still seen.
- PerplexityBot โ Perplexity's crawler. Generally good to allow since they're a citation-heavy product.
- Google-Extended โ separate token to opt out of Gemini training while keeping classic Googlebot active for search indexing.
- CCBot โ Common Crawl, the dataset most public LLMs are trained on.
- Bytespider โ ByteDance / TikTok / Doubao. Aggressive and often a problem; many sites block.
- Amazonbot โ Amazon (Alexa, Rufus). Mixed reputation.