A robots.txt file is a straightforward text document in your website’s root directory. Following the robot exclusion standard, it instructs search engine crawlers on which pages to avoid crawling. These instructions are provided using the User-Agent and Disallow directives.
The User-Agent directive specifies the crawler, while the Disallow directive indicates the URLs not to be crawled. For instance, a robots.txt file with “User-Agent: * Disallow: /” prevents all crawlers from accessing any part of the site.
Example:
User-Agent: *
Disallow: /
This setup blocks all crawlers from accessing the entire site.
When crawlers visit your site, they look for the robots.txt file in the root directory. If it’s absent, they proceed to crawl the entire site. If the file exists, they follow the specified directives.
The primary purpose of the robots.txt file is to manage crawler requests, ensuring they don’t overload your server with excessive requests. However, it’s important to note that robots.txt does not prevent pages from being indexed by Google if other signals, such as external links, point to them.
Common Misconceptions About Robots.txt
A widespread myth is that robots.txt can prevent pages from appearing in Google search results. However, Google can still index pages if other signals are present. Misconfiguring your robots.txt file can have serious consequences, especially for large websites, by unintentionally blocking essential pages from being crawled.
Additionally, not all crawlers respect the robots.txt directives. While legitimate crawlers comply, malicious bots might ignore these instructions, so robots.txt should not be relied upon to protect sensitive information.
How to Use Robots.txt
Search engine crawlers check your robots.txt file before crawling your website. Use the Disallow directive to prevent crawlers from accessing specific pages or sections unsuitable for search engine results.
The crawl budget, which refers to the time and resources search engine crawlers allocate to your site, is optimized by correctly configuring your robots.txt. By configuring your robots.txt properly, you ensure that crawlers focus on valuable pages, enhancing the efficiency of your crawl budget.
Robots.txt Doesn’t Block Indexing
Robots.txt is not a reliable method to prevent indexing. Pages can still be indexed even if they are blocked from being crawled. Instead, use the Noindex directive via meta tags or HTTP response headers to prevent indexing effectively.
Using a Meta Tag: Place a noindex meta tag in your page’s header to prevent indexing:
<meta name="robots" content="noindex">
Using an HTTP Response Header: For a more advanced method, use the X-Robots-Tag in your server’s .htaccess file to prevent indexing:
set X-Robots-Tag "noindex"
Structure of a Robots.txt File
Not all websites include a robots.txt file by default. To check if your site has a robots.txt file, navigate to your website’s root directory and append “/robots.txt” to the URL. Including a robots.txt file is vital for managing search engine behavior on your website.
Example of a Simple Robots.txt File:
User-Agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Sitemap: https://www.website.com/sitemap.xml
This configuration blocks crawlers from the /wp-admin/ section while allowing access to the required admin-ajax.php file and incorporating a link to the Sitemap.
Unauthorized Robot Blocking Guidelines
Robots.txt directives can be refined using non-standard directives like Allow, Crawl-delay, and Sitemap, in addition to User-Agent and Disallow. However, not all robots.txt directives are respected by every crawler.
Example to Allow a Specific File in a Disallowed Directory:
Allow: /folder/examplefile.html
Disallow: /folder/
Setting Crawl-Delay for Bingbot:
User-Agent: Bingbot
Crawl-delay: 20
Testing and Validating Robots.txt
Testing your robots.txt file is crucial before making changes live. Google’s robots.txt Tester, available in the old version of Google Search Console, is a handy tool for robots.txt file testing.
Purpose of a Robots.txt File
The robots.txt file informs web crawlers which parts of your website to ignore. Using the robots.txt file helps prevent web crawlers from wasting time on low-quality pages or getting stuck in endless URL loops, such as those created by a daily calendar.
The robots.txt file specifications guide Google to use a plain text file format encoded in UTF-8, with records separated by CR, CR/LF, or LF. When managing a website, it’s crucial to monitor the size of your robots.txt file, as search engines like Google cap it at 500KB.
Where Should Robots.txt Be Located?
Robots.txt files need to be located at the domain’s root to ensure proper functionality.
By having the robots.txt file specific to the protocol and entire domain, it ensures that robots file hosted on https://www.website.com does not interfere with the crawling process of either https://www.website.com or https://subdomain.website.com.
Therefore, each domain or subdomain should maintain its own robots.txt file for correct web crawling management.
When to Use Robots.txt Rules
Managing website crawling and minimizing control interventions are essential.
To achieve this, keep your website’s architecture clean and accessible.
If immediate fixes aren’t possible, use robots.txt to block crawlers from less critical sections.
Google recommends using robots.txt primarily when server issues arise or to address crawl efficiency problems, such as when Googlebot spends excessive time on a non-indexable section of a site.
Understanding the significance of robots.txt can help manage server load and improve website indexing.
For optimal performance, consider implementing robots.txt when faced with server challenges linked to crawl efficiency issues.
If you want to stop search engines from crawling specific pages with URL parameters, like /page#sort=price, add the rel=nofollow attribute to those links.
Managing URL parameters and SEO this way helps keep unwanted pages out of reach for crawlers.
Blocking Backlinked URLs
Disallowing URLs in robots.txt prevents link equity from passing through to the website, leading to diminished authority transfer and potentially lower overall rankings.
By blocking search engines from following links, the target URL remains disallowed, preventing the website from gaining the authority those links could provide, impacting its ranking negatively.
Remove Indexed Pages from Search Engines
Disallowed pages and search engine indexing are not directly correlated, as using Disallow doesn’t get pages deindexed.
Can search engines index pages that have never been crawled if those pages are blocked?
When it comes to search engine indexing, understanding the impact of blocked pages is essential.
This means your disallowed pages might show up in search results unexpectedly.
This occurs because the processes of crawling and indexing are mainly separate, affecting disallowed pages and search engine indexing.
Blocking Social Network Crawlers
Robots.txt rules are essential for managing search engine crawling, but they should not restrict social networks from accessing pages to create snippets.
It’s important to remember that platforms like Facebook will visit every posted page to generate relevant snippets.
Therefore, when setting robots.txt rules, ensure that social network access is considered.
Blocking Access from Test or Development Sites
Blocking an entire staging site using robots.txt isn’t the best practice for robots.txt implementation. Google recommends allowing the pages to be crawled by robots.txt configurations.
However, it’s generally better to make the staging site inaccessible from the outside world, rather than relying solely on robots.txt for staging site protection.
No Blocking Needed?
Websites with a very clean architecture might not need to block crawlers using a robots.txt file.
In this case, returning a 404 status for a requested robots.txt file is perfectly acceptable and does not pose any issues.
Robots.txt Syntax and Formatting Rules
Understanding what a robots.txt file is and its appropriate usage is crucial.
Now, let’s delve into the standardized syntax and formatting rules that must be followed when writing a robots.txt file.
Comments Lines
Robots.txt documentation includes comments that search engines wholly ignore.
Comments, which start with a #, allow you to write notes about each line’s purpose and existence. In general, it is advised to document every line’s purpose in your robots.txt file.
This ensures it can be removed when no longer necessary and remains unmodified while essential. Documenting robots.txt helps maintain clarity and relevance, enhancing its effectiveness.
Selecting User-Agents to Allow
The “User-agent” directive helps specify rules for specific user agents, enhancing SEO by targeting particular search engines.
For example, to apply unique regulations to Google, Bing, and Yandex but exclude Facebook and ad networks, use targeted user-agent tokens in your SEO strategy.
Each web crawler adheres to a user-agent token, crucial for aligning SEO practices with crawler patterns. Googlebot News, for instance, will search for ‘Googlebot-news,’ followed by ‘Googlebot,’ then fall back to generic rules defined by ‘*.’
Common user-agent tokens for SEO include Baidu’s Baiduspider, bingbot for Bing, Googlebot for Google, Yahoo!’s slurp, and Yandex for Yandex.
Strategic implementation of the user-agent directive rules for SEO can significantly impact your website’s visibility and indexing accuracy.
Directive Guidelines
URL paths are matched against directive rules only, excluding protocols or hostnames. URL paths beginning with a slash in a directive align with the URL path’s beginning.
For instance, the URL path in “Disallow: /starts” would match www.example.com/starts.
Link to Sitemap in robots.txt
Robots.txt and sitemaps improve search engine discovery and crawling of website URLs. Always use absolute URLs (e.g., https://www.example.com/sitemap.xml) for sitemaps in your robots.txt, avoiding relative URLs like /sitemap.xml.
Robots.txt accommodates sitemaps hosted on different root domains or external domains.
Search engines will crawl sitemaps listed in robots.txt, but manual submission is needed for Google Search Console and Bing Webmaster Tools visibility.
Blocked Robots.txt File
The disallow rule in robots.txt can be applied in several ways for different user agents. In this section, we’ll explore various formatting combinations for disallow rules.
However, it’s crucial to remember that directives in robots.txt are merely guidelines.
Malicious crawlers will often ignore your robots.txt file and access any public part of your site regardless, so the disallow rule should not replace robust security measures for your website.
Handling Multiple User-Agent Blocks
To apply disallow regulations to both Googlebot and Bing, list them before the set of rules in a block. For instance, disallow regulations for Googlebot and Bing can be found in the following block of rules:
User-agent: Googlebot
User-agent: bing
Disallow: /a
Spacing Between Instruction Blocks
Search engines may ignore spaces between directives and blocks. In this example, the second line will be picked up even though there are spaces separating the two parts of the rule.
User-agent: *
Disallow: /disallowed/
Disallow: /test1/
Merged Blocks
Combining multiple blocks with the same user agent helps improve SEO by preventing Googlebot from crawling unnecessary paths like “/b” and “/a.” This makes crawling more efficient and helps search engines index your content better.
User-agent: Googlebot
Disallow: /b
User-agent: bing
Disallow: /a
User-agent: Googlebot
Disallow: /a
Allowed in Robots.txt
The robots.txt “allow” rule ensures specific URLs can be crawled explicitly. This rule is default for all URLs, but robots.txt “allow” can overwrite a disallow rule. For instance, if “/collections” is disallowed, robots.txt “allow” can permit crawling of “/collections/socks” using the specific authority “Allow: /collections/socks.”
Robots.txt Prioritization
When several rules allow and disallow a URL, the longest matching rule applies. For the URL “/home/search/shirts,” let’s examine the outcome based on the following rules:
Disallow: /home
Allow: *search/*
Disallow: *shirts
The URL can be crawled in this case because the Allow rule has nine characters, whereas the disallow rule has only seven.
If you require a particular URL to be allowed or disallowed, you can use * to make the path longer.
Example:
Disallow: *******************/socks
When a URL slug checks the allow rule and a disallow rule of the same length, the disallow rule wins. For instance, the URL “/search/shirts” will be blocked in this case, following the SEO guidelines for URL handling.
Disallow: /search
Allow: *socks
Robots.txt Directives
Robots.txt directives offer an easy way to manage your site’s crawl budget.
Unlike page-level instructions, these directives work immediately, helping to streamline the crawling process and save resources.
Robots.txt Noindex
The “robots.txt noindex” directive is not officially supported by Google and may not work in the future, even if it seems effective now.
It’s fine for short-term use but should always be paired with long-term indexing strategies.
Example of how you would use robots.txt noindex:
User-agent: *
Noindex: /folder
Noindex: /*?*sort=
Google often follows additional indexing directives in robots.txt beyond the standard noindex. But keep in mind that not all search engines recognize these directives and their recognition may change over time, so don’t depend on them consistently.
Robots.txt Problems and Solutions
Monitoring your robots.txt file is essential for your site’s performance.
Use Google Search Console to review your robots.txt file, ensure it stays under the 500KB size limit, and cross-check the Index Status report for disallowed URLs.
Final Word: Get your Robots.txt optimized
Mastering robots.txt is a fundamental skill for every SEO professional. Properly configuring your robots.txt file is vital for effective SEO. Misconfigurations can block important pages from crawling, negatively impacting search rankings and traffic.
Ensure you block only unnecessary URLs to optimize your crawl budget and use Noindex directives to prevent indexing.