TeckBlaze
← All articlesHome

← All articles

Technical SEO
Crawl
robots.txt

robots.txt: complete SEO guide

February 26, 2026 · 9 min read

The robots.txt file is the first file search engines consult when visiting your site. It controls which parts of your site can be crawled and by which robots. A misconfigured robots.txt can block indexation of your important pages or, conversely, expose sensitive sections. This guide covers the complete syntax, essential directives, AI bot management, and the most frequent mistakes.

What is robots.txt?

The robots.txt is a text file placed at the root of your website (example.com/robots.txt) that gives directives to search engine crawlers. It uses the Robots Exclusion Standard protocol, a web standard since 1994. Every search engine respects this file before starting to explore your site.

The robots.txt is not a security mechanism: it doesn't block file access, it politely asks robots not to explore them. A malicious robot can ignore these directives. To truly block access, use HTTP authentication, passwords, or server rules (htaccess).

The absence of robots.txt is a problem detected by TeckBlaze during site-wide audits. Without this file, search engines explore your entire site without restrictions, which can waste your crawl budget on unimportant pages (admin panel, test pages, URL parameters).

TeckBlaze analyzes your robots.txt content and extracts user-agents, Allow/Disallow rules, Sitemap references, and AI bot-specific directives. The report presents this information clearly and identifies potential issues.

robots.txt syntax

The robots.txt uses a simple syntax based on User-agent / Directive pairs. Each block starts with a User-agent identifying the target robot, followed by one or more Allow or Disallow directives specifying permitted or forbidden paths. The * (wildcard) character represents all robots.

Here's a basic robots.txt example: User-agent: * targets all robots. Disallow: /admin/ forbids exploring the admin folder. Disallow: /api/ forbids API endpoints. Allow: / permits the rest of the site. Sitemap: https://example.com/sitemap.xml references the sitemap.

Rules are evaluated from most specific to most general. A more specific Allow rule can override a more general Disallow rule. For example, Disallow: /blog/ blocks the entire blog folder, but Allow: /blog/important/ permits exploring that specific subfolder.

Comments are preceded by the # character. Use them to document your rules and explain why certain sections are blocked. This facilitates long-term maintenance, especially in teams where multiple people modify the file.

Allow and Disallow directives

The Disallow directive is the most used. It tells the robot not to explore URLs starting with the specified path. Disallow: /private/ blocks all URLs starting with /private/. An empty Disallow (Disallow: ) means everything is allowed — it's equivalent to not having a robots.txt.

The Allow directive creates exceptions to Disallow rules. It's particularly useful when you want to block an entire directory except specific files. For example: Disallow: /assets/ followed by Allow: /assets/images/ blocks the entire assets folder except images.

Be careful with sensitive paths to not index: administration pages (/admin/, /wp-admin/), login pages (/login/, /signin/), internal search results (/search?, /recherche?), sort and filter pages (/products?sort=, /produits?filtre=), and thank you or confirmation pages (/merci, /thank-you).

Never block your CSS and JavaScript files with robots.txt. Google needs to load them to render your pages. Blocking these resources prevents Google from seeing your page as users see it, which can negatively affect your rankings.

The Sitemap directive

The Sitemap directive in robots.txt tells search engines the location of your XML sitemap file. This directive is independent of User-agent rules and can appear anywhere in the file. It helps robots discover all your important pages in a structured way.

The syntax is simple: Sitemap: https://example.com/sitemap.xml. You can declare multiple sitemaps if your site uses a sitemap index or separate sitemaps for different sections. Each sitemap URL must be absolute (starting with https://).

TeckBlaze checks for the Sitemap directive in your robots.txt as part of the site-level GEO score. The absence of this directive reduces your score because AI bots use the sitemap to discover your content exhaustively. It's a simple signal to add but often forgotten.

AI bot management

With the emergence of generative search engines, managing AI bots in your robots.txt has become a strategic issue. The main AI bots are: GPTBot (OpenAI/ChatGPT), ChatGPT-User (OpenAI), ClaudeBot (Anthropic/Claude), anthropic-ai (Anthropic), PerplexityBot (Perplexity), Google-Extended (Google Gemini), Bingbot (Microsoft Copilot), and CCBot (Common Crawl).

To allow all AI bots, ensure your robots.txt doesn't contain specific Disallow directives for these user-agents. The User-agent: * rule with Allow: / generally suffices. To block a specific bot, add a dedicated block: User-agent: GPTBot followed by Disallow: /.

TeckBlaze analyzes your robots.txt and identifies each AI bot's status individually: allowed, blocked, or not mentioned. The site-level GEO report awards +25 points if no AI bots are blocked, representing 25% of the site-level score.

The decision to block or allow AI bots depends on your business strategy. Most sites benefit from AI visibility. Sites whose business model relies on premium content (newspapers, paid research) may choose to block certain bots to protect their intellectual property.

Common mistakes

Accidentally blocking the entire site with Disallow: / under User-agent: * is the most dangerous error. It prevents all search engines from exploring your site, causing complete deindexation. Always verify your robots.txt after every modification.

Using robots.txt to hide pages instead of meta noindex is a conceptual error. Disallow prevents crawling but not indexation: if other pages link to a blocked URL, Google can still index it (without content). To truly prevent indexation, use the <meta name="robots" content="noindex"> tag.

Forgetting the trailing slash in folder paths can cause problems. Disallow: /admin blocks /admin, /administrator, /admin-tools, etc. Disallow: /admin/ only blocks the /admin/ folder and its contents. Be precise with your paths.

Not testing robots.txt after modification is risky. Use the Google Search Console robots.txt testing tool to verify your rules work as expected. Test with specific URLs to ensure important pages aren't accidentally blocked.

Having an overly restrictive robots.txt wastes your site's crawl potential. Only block what's necessary (admin, API, private files) and leave the rest accessible. The more open your site is to crawlers, the better it will be indexed.

FAQ

The robots.txt file must be placed exactly at the root of your domain, accessible at the URL https://yourdomain.com/robots.txt. It cannot be in a subfolder or different subdomain. Each subdomain (blog.yourdomain.com) must have its own robots.txt. The file must be in plain text (text/plain) and UTF-8 encoded. For Next.js sites, you can place the file in the /public/ folder or generate it dynamically via a route handler.

To block a specific bot, add a dedicated block in your robots.txt with the bot's User-agent. For example, to block GPTBot (ChatGPT's crawler): User-agent: GPTBot, Disallow: /. This blocks only GPTBot without affecting other robots. You can also partially block: User-agent: GPTBot, Disallow: /private/ blocks only the /private/ folder for GPTBot. Common AI bot user-agent names are: GPTBot, ChatGPT-User, ClaudeBot, anthropic-ai, PerplexityBot, Google-Extended, CCBot.

No, robots.txt prevents crawling (exploration) but not indexation. If other pages contain links to a URL blocked by robots.txt, Google can still index that URL in its results, but without content (since it couldn't crawl it). To prevent indexation, use the meta noindex tag (<meta name="robots" content="noindex">) or the HTTP header X-Robots-Tag: noindex. TeckBlaze detects noindex tags and flags them as critical because they completely prevent indexation.

Related articles:

SEO TechniqueGEO / IAGEO & IA
Analyze your robots.txt with TeckBlaze