Engineering Originally on dev.to

The Complete robots.txt for 2026: Every AI Crawler You Should Know About

Most websites accidentally block AI crawlers. Here's the definitive list of AI bot user agents and how to configure robots.txt.

P
Pablo Inigo · Founder & Engineer
2 min read
AI crawler bots approaching a website with robots.txt access control

Most websites are accidentally blocking AI crawlers. ChatGPT can’t cite you. Perplexity can’t find you. Claude can’t read you. Here’s the definitive list of AI bot user agents and how to configure robots.txt for maximum visibility.

The AI Crawlers You Need to Know

OpenAI (ChatGPT)

User-agent: GPTBot          # Training + products
User-agent: ChatGPT-User    # Browse-with-ChatGPT feature
User-agent: OAI-SearchBot   # ChatGPT Search (citations!)

OAI-SearchBot is the most important for traffic. When someone asks ChatGPT “best job boards in Europe,” this bot fetches pages to cite.

Google (Gemini)

User-agent: Google-Extended  # Gemini/AI Overviews training
User-agent: GoogleOther      # R&D crawling

Blocking Google-Extended removes you from AI Overviews but keeps you in normal search. Usually you want both.

Anthropic (Claude)

User-agent: ClaudeBot       # Claude's web browsing
User-agent: anthropic-ai    # Older identifier

Perplexity

User-agent: PerplexityBot   # Perplexity AI search

Perplexity is growing fast as an AI search engine. Being cited here drives real traffic.

Apple (Siri / Apple Intelligence)

User-agent: Applebot           # Siri, Spotlight
User-agent: Applebot-Extended  # Apple Intelligence features

Meta

User-agent: meta-externalagent  # Meta AI training
User-agent: FacebookBot         # Meta AI + link previews

Others

User-agent: CopilotBot    # Microsoft Copilot
User-agent: YouBot         # You.com AI search
User-agent: cohere-ai      # Cohere RAG
User-agent: CCBot          # Common Crawl (feeds many AI systems)
User-agent: Bytespider     # ByteDance/TikTok
User-agent: Amazonbot      # Alexa answers

The Simple Approach

If you want maximum AI visibility (recommended for most sites):

User-agent: *
Allow: /

Sitemap: https://yoursite.com/sitemap.xml

That’s it. Allow everything. Let every AI system read, cite, and reference your content.

The Selective Approach

If you want search engines and AI search, but not AI training:

# Allow search + AI search
User-agent: Googlebot
Allow: /

User-agent: GPTBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

# Block pure training crawlers
User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

Why This Matters

Ahrefs shows an “AI Citations” metric now. Sites that block AI crawlers show 0 citations. Sites that allow them get referenced in ChatGPT, Perplexity, and Gemini responses — which is increasingly where people find information.

At MisuJob, we allow ALL AI crawlers. Our job listings appear in AI search results, driving traffic from ChatGPT Search and Perplexity.


What’s your robots.txt policy for AI crawlers? Block all, allow all, or selective?

SEO Artificial Intelligence Web Development Robots.txt
Share
P
Pablo Inigo

Founder & Engineer

Building MisuJob — an AI-powered job matching platform processing 1M+ tech job listings daily.

Engineering updates

Technical deep dives delivered to your inbox.

Find your next role with AI

Upload your CV. Get matched to 50,000+ jobs. Auto-apply to the best fits.

Get Started Free

User

Dashboard Profile Subscription