The Ultimate Robots.txt Guide for Modern SEO
A robots.txt file is essentially the "bouncer" of your website. When a search engine bot (like Googlebot) arrives at your domain, the very first file it looks for is yourdomain.com/robots.txt. This plain text file tells the bot which directories it is allowed to crawl and which private areas (like your admin panel) it should ignore.
The AI Scraping Threat
In recent years, massive AI bots like OpenAI's GPTBot and Common Crawl's CCBot have been relentlessly scraping websites to train their language models—without giving any credit or traffic to the content creators. Adding specific "Disallow" directives for these bots protects your intellectual property.
The Sitemap Directive
Always include your XML Sitemap URL at the very bottom of your robots.txt file. This acts as a direct roadmap for Google and Bing, helping them discover your new articles and products much faster than standard link crawling.
Does "Disallow" mean the page won't be indexed?
This is one of the most common misconceptions in the SEO world. Disallow stops the bot from crawling the page, but if another site links to that page, Google might still index the URL (usually showing a "No information is available for this page" warning in search results). If you want to completely hide a page from Google, you must add a noindex meta tag to the page's HTML code itself.