SEO Basics

How Websites Tell Google and ChatGPT What to Look At

Robots.txt is a polite request, not a wall. Google obeys it. Most AI scrapers don't. Here's what it actually does and how to set it up.

What you'll learn

Every website has an invisible doorman. Before any search engine or AI crawler looks at a single page on your site, it checks a plain text file at /robots.txt. This file contains a set of rules written in a simple format: User-agent identifies which crawler the rule applies to, Allow tells it which paths it can access, and Disallow tells it which paths to skip. It is the first thing any well-behaved bot reads when it arrives at your domain.

The robots.txt protocol was created in 1994 as a voluntary standard. The key word is voluntary. Search engines like Google, Bing, and Yahoo have honored it consistently for decades. When Googlebot sees Disallow: /admin/, it will not crawl any page under that path. This makes robots.txt useful for keeping staging environments, internal dashboards, duplicate filtered views, and other non-public pages out of search indexes.

The AI crawler landscape is fundamentally different. Companies like OpenAI (GPTBot), Anthropic (ClaudeBot), Meta, Apple, and others have deployed web scrapers that vacuum up content for training data and real-time retrieval. Some of these bots respect robots.txt. Many do not. Independent audits show that roughly 13 percent of AI bots ignore robots.txt directives entirely. Even among those that technically comply, the compliance is often selective. A bot might respect Disallow rules for one User-agent string while scraping freely under a different or undeclared User-agent.

This creates an important mental model for how to think about robots.txt. It is a polite request, not a security mechanism. It tells well-behaved crawlers what to skip. It does nothing to prevent bad actors from accessing your content. If a page is truly sensitive, it needs authentication, not a robots.txt entry.

The practical setup is simple. Block paths you do not want indexed: /admin/, /staging/, /api/, and any duplicate or parameterized views. Include a Sitemap directive pointing to your sitemap.xml so crawlers can find it without searching. For AI bots you want to block, add User-agent rules for GPTBot, CCBot, anthropic-ai, and others, but understand that enforcement depends entirely on the bot's willingness to comply. Review your robots.txt every time you add a new section to your site to make sure you are not accidentally blocking pages you want Google to find.

We handle all of this, end to end.

Book a call →