Mixed Directives: Why Robots.txt Files Are Handled by Subdomain and Protocol

The critical configuration mistake that catches countless website owners off guard--and how to avoid it

Every website owner who has migrated from HTTP to HTTPS, removed the www prefix, or launched a subdomain for a blog or shop has faced a critical but often overlooked configuration challenge: robots.txt directives. What many don't realize is that search engines treat robots.txt files as completely separate entities for each unique combination of subdomain and protocol. A robots.txt file on https://www.example.com has no influence over https://example.com, and vice versa. This fundamental behavior of the Robots Exclusion Protocol has caught countless website owners off guard, resulting in unintended indexing issues, duplicate content problems, and wasted crawl budget.

Understanding this behavior is essential for any web development project, as proper configuration directly impacts how search engines discover and index your content across all site variations.

The Fundamentals: How Robots.txt Scope Works

The robots.txt file operates on a simple but crucial principle: it only controls crawling for the exact host, protocol, and port number where it resides. According to Google's official documentation on robots.txt, the rules listed in a robots.txt file apply only to the specific combination of host, protocol, and port number where that robots.txt file is hosted.

This means that each of these URLs would require its own properly configured robots.txt file:

The Root Directory Requirement

The robots.txt file must be placed in the top-level directory of a site to be valid. Crawlers don't check for robots.txt files in subdirectories or subdomains automatically. This placement requirement ensures that the file can be accessed at the exact URL of domain.com/robots.txt. If the file is missing from the root directory, crawlers will assume there are no crawl restrictions for that host/protocol combination.

Port Number Considerations

Standard port numbers (80 for HTTP, 443 for HTTPS, 21 for FTP) are equivalent to their default hostnames. However, robots.txt files on non-standard port numbers are only valid for content made available through those specific port numbers. A robots.txt at https://example.com:8080/robots.txt would only apply to URLs accessed through port 8080, which is why most production deployments stick to standard ports for consistent robots.txt behavior.

As noted in Moz's comprehensive guide to robots.txt, understanding these scope rules is foundational to proper search engine optimization and crawl management.

Each subdomain and protocol requires its own robots.txt

# Main domain - https://example.com/robots.txt
User-agent: *
Allow: /

# www version - https://www.example.com/robots.txt
User-agent: *
Allow: /

# Blog subdomain - https://blog.example.com/robots.txt
User-agent: *
Allow: /

# Shop subdomain - https://shop.example.com/robots.txt
User-agent: *
Allow: /
Disallow: /checkout/
Disallow: /account/

The www vs Non-www Dilemma

One of the most common configuration oversights involves the www prefix. Many website owners configure their site to redirect from www to non-www (or vice versa) but forget to ensure consistent robots.txt directives across both versions. According to the Search Engine Land case study on mixed directives, this oversight can lead to situations where one version of the site gets properly crawled while the other version accidentally blocks critical content.

The critical insight here is that www.example.com and example.com are treated as completely different domains by search engine crawlers. Each requires its own robots.txt file with appropriate directives. If you redirect www to non-www at the server level, you must still maintain a robots.txt file at the www address, even if it simply allows all crawling.

Configuring Both Versions Correctly

For www.example.com:

User-agent: *
Allow: /

For example.com:

User-agent: *
Allow: /

This ensures that whichever version users and crawlers access, the appropriate crawling permissions are in place. When implementing technical SEO best practices, consistent robots.txt configuration across all domain variants is a fundamental requirement that many practitioners overlook.

HTTP vs HTTPS Protocol Considerations

The transition from HTTP to HTTPS is another area where robots.txt configuration often falls short. When migrating to HTTPS, website owners must create a new robots.txt file for the HTTPS version of their site. The robots.txt file at http://example.com/robots.txt has absolutely no effect on how Google's crawlers behave when accessing https://example.com, as documented in Google's robots.txt specification.

This protocol separation becomes especially important during migration periods when both HTTP and HTTPS versions of a site are accessible. During this transitional phase, you may want different crawling behavior for each protocol version. For example, you might want to allow full crawling of the HTTPS version while slowly restricting the HTTP version as you complete the migration.

HTTPS Migration Best Practices

Create an identical robots.txt file for the HTTPS version with the same directives as the HTTP version
Ensure your HTTP to HTTPS redirects are working correctly before changing robots.txt
Update the sitemap URL in robots.txt to point to the HTTPS sitemap
Monitor Search Console for both HTTP and HTTPS properties during the transition

Proper HTTPS implementation is a critical component of modern web development, and robots.txt configuration should be part of your migration checklist to ensure seamless indexing across protocols.

Subdomain-Specific Robots.txt Files

Each subdomain requires its own robots.txt file. This is particularly important for websites that use subdomains for different sections of their site, such as blogs (blog.example.com), e-commerce shops (shop.example.com), or documentation (docs.example.com). The robots.txt file on the main domain does not control crawling of subdomains, as confirmed in Google's robots.txt documentation.

Common Subdomain Scenarios

Blog Subdomains: Many websites use a blog.example.com structure for their content marketing. Each blog post represents unique content that search engines need to discover. Your blog subdomain needs its own robots.txt with appropriate directives, potentially including a reference to the blog-specific sitemap.

E-commerce Subdomains: For sites with shop.example.com, you may want different crawling rules than the main site. Perhaps you want to disallow crawling of checkout and account pages while allowing product and category pages to be fully crawled.

Staging and Development: Staging environments often use subdomains like staging.example.com or dev.example.com. These should typically have restrictive robots.txt files to prevent indexing of pre-release content.

The Shop.example.com Scenario

A common scenario involves an e-commerce site with the main website at example.com and the shop at shop.example.com. If the main site's robots.txt disallows crawling of admin directories, this has no effect on the shop subdomain. Each must be configured independently to ensure proper crawling behavior. This independence is powerful--it allows you to fine-tune crawling rules for each section of your online presence while maintaining overall site health.

For e-commerce web development projects, proper subdomain configuration becomes even more critical as these sites typically have larger site architectures requiring careful crawl budget management.

Common Mistakes and How to Avoid Them

Blocking Critical Resources by Accident

One of the most damaging mistakes is accidentally blocking CSS and JavaScript files that search engines need to properly render pages. Modern web development relies heavily on these resources for page rendering, and blocking them can significantly impact how search engines understand and index your content. As highlighted in Moz's best practices guide, always ensure your Allow directives include static assets.

Mixed Protocol Configuration

During website migrations, failing to update robots.txt for the new protocol can result in inconsistent crawling behavior. Some crawlers may continue accessing the old HTTP version while others follow redirects to HTTPS, creating potential indexing inconsistencies. This is why maintaining parallel robots.txt files during migration is essential.

Orphaned Subdomain Configuration

Subdomains that were once used but are no longer maintained may still have outdated or overly restrictive robots.txt files, causing crawling issues for sections of the site that may still be accessible to users. Regular audits of all subdomain configurations prevent these orphaned configurations from causing unexpected SEO problems.

Inconsistent User-Agent Groups

Different user-agent groups in robots.txt can behave unexpectedly when combined with subdomain and protocol variations. It's essential to test configuration changes using Google's Robots Testing Tool before deploying to production.

When implementing search engine optimization, proper robots.txt configuration across all site variants is a foundational element that supports all other optimization efforts.

Best Practices for Multi-Variant Sites

Document Your Configuration

Create a tracking document for all robots.txt files across your domain variants to ensure consistency and easier troubleshooting.

Use Consistent Patterns

Maintain the same basic structure across all robots.txt files, only varying specific directives where necessary.

Centralized Management

For large sites with multiple subdomains, use configuration management to ensure all robots.txt files remain synchronized.

Regular Audits

Periodically audit all robots.txt files across your domain variants to ensure they remain appropriate for your current site structure.

Code Examples for Common Scenarios

Basic Allow All

1User-agent: *2Allow: /

Block Specific Directory

1User-agent: *2Disallow: /private/3Disallow: /tmp/

Allow Specific Crawler

1User-agent: Googlebot2Allow: /3 4User-agent: *5Disallow: /

Block AI Bots

1User-agent: GPTbot2Disallow: /3 4User-agent: ClaudeBot5Disallow: /

Performance Impact

100%

Each protocol/subdomain needs unique robots.txt

robots.txt per host/protocol combination

24h

Google caches robots.txt for up to this long

Performance Implications

Proper robots.txt configuration directly impacts crawl budget efficiency. When crawlers encounter misconfigured robots.txt files, they may spend time attempting to access resources they shouldn't, wasting your crawl budget on non-essential content. This is especially important for large-scale web applications with thousands of pages.

Conversely, well-configured robots.txt files help search engines focus their crawling efforts on your most important pages, ensuring fresh content gets indexed quickly while preventing duplicate or low-value pages from consuming crawl resources. As noted in Moz's crawl budget optimization guidance, this efficiency directly affects how quickly new content appears in search results.

For large websites with thousands of pages, proper robots.txt configuration becomes critical for maintaining efficient indexing. By directing crawlers away from non-essential sections like admin interfaces, duplicate content, and thin pages, you ensure that crawl budget is spent on content that matters for SEO.

The performance benefits extend beyond just indexing speed--proper configuration reduces server load from redundant crawler requests, improves the accuracy of your search analytics, and ensures that your most valuable content receives the crawling priority it deserves.

Testing and Validation

Before deploying any robots.txt changes, validate your configuration using these methods:

Google's Robots Testing Tool - Test specific URLs against your robots.txt in Google Search Console
Search Console Coverage Report - Monitor indexing status and any robots.txt-related issues across all properties
Server Log Analysis - Review actual crawler behavior after deployment to verify expected patterns
Regular Audits - Schedule quarterly reviews of all robots.txt configurations to ensure they remain appropriate

For comprehensive search engine optimization, testing should be an ongoing process, not a one-time event. Each time you add a new subdomain or transition to a new protocol, robots.txt configuration should be part of your implementation checklist.

Additionally, consider using monitoring tools that alert you to changes in how search engines are crawling your site. Sudden changes in crawl patterns could indicate a robots.txt misconfiguration that needs immediate attention.

Conclusion

The scope of robots.txt is more nuanced than many website owners realize. Each combination of subdomain, protocol, and port number represents a separate crawling context that requires its own properly configured robots.txt file. By understanding and properly implementing robots.txt across all variants of your website, you can prevent indexing issues, optimize crawl budget, and ensure that search engines access exactly the content you want them to.

The key takeaway is simple but crucial: never assume that a robots.txt file on one version of your site will affect crawling of another version. Take the time to audit and configure robots.txt for every subdomain and protocol combination your site uses.

For professional assistance with web development and technical SEO configuration, including proper robots.txt implementation across all site variants, contact our team for a consultation. We can help ensure your website configuration supports optimal search engine performance.

Frequently Asked Questions

Need Help with Your Website Configuration?

Our web development team can help ensure your site is properly configured for search engine crawling and indexing.