Home » What is a robots.txt file?

What is a robots.txt file?

Written by Lawrence Hitches

13 min read
Posted 20 July 2024

The robots.txt file is a crucial element for any website, serving as a set of instructions for search engine crawlers about which URLs they should avoid. Grasping the essence of the robots.txt file, its function, how to configure it, and how to validate its directives is essential for effective technical SEO.

This post will explain robots.txt in detail, helping you use it to enhance your SEO strategy and improve your search rankings.

In This Article

A robots.txt file is a straightforward text document in your website’s root directory. Following the robot exclusion standard, it instructs search engine crawlers on which pages to avoid crawling. These instructions are provided using the User-Agent and Disallow directives.

The User-Agent directive specifies the crawler, while the Disallow directive indicates the URLs not to be crawled. For instance, a robots.txt file with “User-Agent: * Disallow: /” prevents all crawlers from accessing any part of the site.

Example:

User-Agent: * 
Disallow: /

This setup blocks all crawlers from accessing the entire site.

When crawlers visit your site, they look for the robots.txt file in the root directory. If it’s absent, they proceed to crawl the entire site. If the file exists, they follow the specified directives.

The primary purpose of the robots.txt file is to manage crawler requests, ensuring they don’t overload your server with excessive requests. However, it’s important to note that robots.txt does not prevent pages from being indexed by Google if other signals, such as external links, point to them.

Common Misconceptions About Robots.txt

A widespread myth is that robots.txt can prevent pages from appearing in Google search results. However, Google can still index pages if other signals are present. Misconfiguring your robots.txt file can have serious consequences, especially for large websites, by unintentionally blocking essential pages from being crawled.

Additionally, not all crawlers respect the robots.txt directives. While legitimate crawlers comply, malicious bots might ignore these instructions, so robots.txt should not be relied upon to protect sensitive information.

How to Use Robots.txt

Search engine crawlers check your robots.txt file before crawling your website. Use the Disallow directive to prevent crawlers from accessing specific pages or sections unsuitable for search engine results.

The crawl budget, which refers to the time and resources search engine crawlers allocate to your site, is optimized by correctly configuring your robots.txt. By configuring your robots.txt properly, you ensure that crawlers focus on valuable pages, enhancing the efficiency of your crawl budget.

Robots.txt Doesn’t Block Indexing

Robots.txt is not a reliable method to prevent indexing. Pages can still be indexed even if they are blocked from being crawled. Instead, use the Noindex directive via meta tags or HTTP response headers to prevent indexing effectively.

Using a Meta Tag: Place a noindex meta tag in your page’s header to prevent indexing:

<meta name="robots" content="noindex">

Using an HTTP Response Header: For a more advanced method, use the X-Robots-Tag in your server’s .htaccess file to prevent indexing:

set X-Robots-Tag "noindex"

Structure of a Robots.txt File

Not all websites include a robots.txt file by default. To check if your site has a robots.txt file, navigate to your website’s root directory and append “/robots.txt” to the URL. Including a robots.txt file is vital for managing search engine behavior on your website.

Example of a Simple Robots.txt File:

User-Agent: * 
Disallow: /wp-admin/ 
Allow: /wp-admin/admin-ajax.php 
Sitemap: https://www.website.com/sitemap.xml

This configuration blocks crawlers from the /wp-admin/ section while allowing access to the required admin-ajax.php file and incorporating a link to the Sitemap.

Unauthorized Robot Blocking Guidelines

Robots.txt directives can be refined using non-standard directives like Allow, Crawl-delay, and Sitemap, in addition to User-Agent and Disallow. However, not all robots.txt directives are respected by every crawler.

Example to Allow a Specific File in a Disallowed Directory:

Allow: /folder/examplefile.html 
Disallow: /folder/

Setting Crawl-Delay for Bingbot:

User-Agent: Bingbot 
Crawl-delay: 20

Testing and Validating Robots.txt

Testing your robots.txt file is crucial before making changes live. Google’s robots.txt Tester, available in the old version of Google Search Console, is a handy tool for robots.txt file testing.

Purpose of a Robots.txt File

The robots.txt file informs web crawlers which parts of your website to ignore. Using the robots.txt file helps prevent web crawlers from wasting time on low-quality pages or getting stuck in endless URL loops, such as those created by a daily calendar.

The robots.txt file specifications guide Google to use a plain text file format encoded in UTF-8, with records separated by CR, CR/LF, or LF. When managing a website, it’s crucial to monitor the size of your robots.txt file, as search engines like Google cap it at 500KB.

Where Should Robots.txt Be Located?

Robots.txt files need to be located at the domain’s root to ensure proper functionality.

By having the robots.txt file specific to the protocol and entire domain, it ensures that robots file hosted on https://www.website.com does not interfere with the crawling process of either https://www.website.com or https://subdomain.website.com.

Therefore, each domain or subdomain should maintain its own robots.txt file for correct web crawling management.

When to Use Robots.txt Rules

Managing website crawling and minimizing control interventions are essential.

To achieve this, keep your website’s architecture clean and accessible.

If immediate fixes aren’t possible, use robots.txt to block crawlers from less critical sections.

Google recommends using robots.txt primarily when server issues arise or to address crawl efficiency problems, such as when Googlebot spends excessive time on a non-indexable section of a site.

Understanding the significance of robots.txt can help manage server load and improve website indexing.

For optimal performance, consider implementing robots.txt when faced with server challenges linked to crawl efficiency issues.

If you want to stop search engines from crawling specific pages with URL parameters, like /page#sort=price, add the rel=nofollow attribute to those links.

Managing URL parameters and SEO this way helps keep unwanted pages out of reach for crawlers.

Blocking Backlinked URLs

Disallowing URLs in robots.txt prevents link equity from passing through to the website, leading to diminished authority transfer and potentially lower overall rankings.

By blocking search engines from following links, the target URL remains disallowed, preventing the website from gaining the authority those links could provide, impacting its ranking negatively.

Remove Indexed Pages from Search Engines

Disallowed pages and search engine indexing are not directly correlated, as using Disallow doesn’t get pages deindexed.

Can search engines index pages that have never been crawled if those pages are blocked?

When it comes to search engine indexing, understanding the impact of blocked pages is essential.

This means your disallowed pages might show up in search results unexpectedly.

This occurs because the processes of crawling and indexing are mainly separate, affecting disallowed pages and search engine indexing.

Blocking Social Network Crawlers

Robots.txt rules are essential for managing search engine crawling, but they should not restrict social networks from accessing pages to create snippets.

It’s important to remember that platforms like Facebook will visit every posted page to generate relevant snippets.

Therefore, when setting robots.txt rules, ensure that social network access is considered.

Blocking Access from Test or Development Sites

Blocking an entire staging site using robots.txt isn’t the best practice for robots.txt implementation. Google recommends allowing the pages to be crawled by robots.txt configurations.

However, it’s generally better to make the staging site inaccessible from the outside world, rather than relying solely on robots.txt for staging site protection.

No Blocking Needed?

Websites with a very clean architecture might not need to block crawlers using a robots.txt file.

In this case, returning a 404 status for a requested robots.txt file is perfectly acceptable and does not pose any issues.

Robots.txt Syntax and Formatting Rules

Understanding what a robots.txt file is and its appropriate usage is crucial.

Now, let’s delve into the standardized syntax and formatting rules that must be followed when writing a robots.txt file.

Comments Lines

Robots.txt documentation includes comments that search engines wholly ignore.

Comments, which start with a #, allow you to write notes about each line’s purpose and existence. In general, it is advised to document every line’s purpose in your robots.txt file.

This ensures it can be removed when no longer necessary and remains unmodified while essential. Documenting robots.txt helps maintain clarity and relevance, enhancing its effectiveness.

Selecting User-Agents to Allow

The “User-agent” directive helps specify rules for specific user agents, enhancing SEO by targeting particular search engines.

For example, to apply unique regulations to Google, Bing, and Yandex but exclude Facebook and ad networks, use targeted user-agent tokens in your SEO strategy.

Each web crawler adheres to a user-agent token, crucial for aligning SEO practices with crawler patterns. Googlebot News, for instance, will search for ‘Googlebot-news,’ followed by ‘Googlebot,’ then fall back to generic rules defined by ‘*.’

Common user-agent tokens for SEO include Baidu’s Baiduspider, bingbot for Bing, Googlebot for Google, Yahoo!’s slurp, and Yandex for Yandex.

Strategic implementation of the user-agent directive rules for SEO can significantly impact your website’s visibility and indexing accuracy.

Directive Guidelines

URL paths are matched against directive rules only, excluding protocols or hostnames. URL paths beginning with a slash in a directive align with the URL path’s beginning.

For instance, the URL path in “Disallow: /starts” would match www.example.com/starts.

Robots.txt and sitemaps improve search engine discovery and crawling of website URLs. Always use absolute URLs (e.g., https://www.example.com/sitemap.xml) for sitemaps in your robots.txt, avoiding relative URLs like /sitemap.xml.

Robots.txt accommodates sitemaps hosted on different root domains or external domains.

Search engines will crawl sitemaps listed in robots.txt, but manual submission is needed for Google Search Console and Bing Webmaster Tools visibility.

Blocked Robots.txt File

The disallow rule in robots.txt can be applied in several ways for different user agents. In this section, we’ll explore various formatting combinations for disallow rules.

However, it’s crucial to remember that directives in robots.txt are merely guidelines.

Malicious crawlers will often ignore your robots.txt file and access any public part of your site regardless, so the disallow rule should not replace robust security measures for your website.

Handling Multiple User-Agent Blocks

To apply disallow regulations to both Googlebot and Bing, list them before the set of rules in a block. For instance, disallow regulations for Googlebot and Bing can be found in the following block of rules:

User-agent: Googlebot 
User-agent: bing 
Disallow: /a

Spacing Between Instruction Blocks

Search engines may ignore spaces between directives and blocks. In this example, the second line will be picked up even though there are spaces separating the two parts of the rule.

User-agent: * 
Disallow: /disallowed/ 
Disallow: /test1/

Merged Blocks

Combining multiple blocks with the same user agent helps improve SEO by preventing Googlebot from crawling unnecessary paths like “/b” and “/a.” This makes crawling more efficient and helps search engines index your content better.

User-agent: Googlebot 
Disallow: /b 
User-agent: bing 
Disallow: /a 
User-agent: Googlebot 
Disallow: /a

Allowed in Robots.txt

The robots.txt “allow” rule ensures specific URLs can be crawled explicitly. This rule is default for all URLs, but robots.txt “allow” can overwrite a disallow rule. For instance, if “/collections” is disallowed, robots.txt “allow” can permit crawling of “/collections/socks” using the specific authority “Allow: /collections/socks.”

Robots.txt Prioritization

When several rules allow and disallow a URL, the longest matching rule applies. For the URL “/home/search/shirts,” let’s examine the outcome based on the following rules:

Disallow: /home 
Allow: *search/* 
Disallow: *shirts

The URL can be crawled in this case because the Allow rule has nine characters, whereas the disallow rule has only seven.

If you require a particular URL to be allowed or disallowed, you can use * to make the path longer.

Example:

Disallow: *******************/socks

When a URL slug checks the allow rule and a disallow rule of the same length, the disallow rule wins. For instance, the URL “/search/shirts” will be blocked in this case, following the SEO guidelines for URL handling.

Disallow: /search 
Allow: *socks

Robots.txt Directives

Robots.txt directives offer an easy way to manage your site’s crawl budget.

Unlike page-level instructions, these directives work immediately, helping to streamline the crawling process and save resources.

Robots.txt Noindex

The “robots.txt noindex” directive is not officially supported by Google and may not work in the future, even if it seems effective now.

It’s fine for short-term use but should always be paired with long-term indexing strategies.

Example of how you would use robots.txt noindex:

User-agent: * 
Noindex: /folder 
Noindex: /*?*sort=

Google often follows additional indexing directives in robots.txt beyond the standard noindex. But keep in mind that not all search engines recognize these directives and their recognition may change over time, so don’t depend on them consistently.

Robots.txt Problems and Solutions

Monitoring your robots.txt file is essential for your site’s performance.

Use Google Search Console to review your robots.txt file, ensure it stays under the 500KB size limit, and cross-check the Index Status report for disallowed URLs.

Final Word: Get your Robots.txt optimized

Mastering robots.txt is a fundamental skill for every SEO professional. Properly configuring your robots.txt file is vital for effective SEO. Misconfigurations can block important pages from crawling, negatively impacting search rankings and traffic.

Ensure you block only unnecessary URLs to optimize your crawl budget and use Noindex directives to prevent indexing.

Written by Lawrence Hitches

Posted 20 July 2024

Lawrence an SEO professional and the General Manager of Australia’s Largest SEO Agency – StudioHawk; he’s been working in search for eight years, having started working with Bing Search to improve their algorithm. Then, jumping over to working on small, medium, and enterprise businesses with SEO tactics to reach more customers on search engines such as Google, he’s won the Young Search Professional of the Year from the Semrush Awards and Best Large SEO Agency at the Global Search Awards.

He’s now focused on educating those who want to learn about SEO with the techniques and tips he’s learned from experience and continuing to learn new tactics as search evolves.