What is Crawlability?

Written by Lawrence Hitches

7 min read
Posted 21 August 2024

Crawlability refers to how well a search engine, like Google, can access and navigate your website’s pages and resources. If your site isn’t easily crawlable, it can positively impact your organic search rankings because search engines might need help to access and understand your content.

It’s essential to differentiate between crawlability and indexability. While crawlability concerns the search engine’s ability to reach your pages, indexability concerns whether it can analyze and include those pages in its index. For a page to appear in search engine results, it must be both crawlable and indexable, making both factors vital for SEO.

In This Article

Why is Crawlability Important?

Crawlability is a big deal if you want your website to get noticed through organic search traffic. When a search engine can crawl your pages, it can read and analyze the content, which is the first step to getting that content into the search index.

Crawling allows a page to be indexed appropriately. I say “appropriately” because, in rare cases, Google may index a URL without fully crawling it, relying instead on the URL and anchor text from backlinks. However, when this happens, the page title and description might not display on the search engine results page (SERP).

Overview of Google Crawlers and Fetchers

Google uses a variety of crawlers and fetchers to perform tasks across its products, either automatically or when triggered by user requests.

Crawlers:A “crawler” (also known as a “robot” or “spider”) is a program designed to automatically discover and scan websites by following links from one page to another. Google’s primary crawler for Google Search is called Googlebot.

Fetchers:
Fetchers function like browsers, requesting a single URL when prompted by a user. These are essential for retrieving specific pages on demand rather than crawling an entire site.

Understanding Logs and Rules:

  • Referrer Logs: Google provides tables listing the various crawlers and fetchers used across its products, helping you identify these agents in your website’s logs.
  • Robots.txt Rules: The “User-agent” token in the robots.txt file specifies rules for different crawler types. Matching just one of the crawler’s tokens will apply the rule.
  • User-Agent String: This string offers a detailed description of the crawler, appearing in both HTTP requests and web logs. However, because it can be easily spoofed, it’s essential to verify the authenticity of a visitor.

Common Search Crawlers

Besides Googlebot, several other bots regularly crawl websites for various purposes, such as SEO audits, content indexing, and site health monitoring. Here are some of the more commonly known ones:

  1. Bingbot – Microsoft’s search engine crawler, similar to Googlebot, used by Bing to index content.
  2. Yahoo Slurp – Yahoo is the web crawler for Yahoo, though it’s primarily based on Bing’s search engine now.
  3. DuckDuckBot – The crawler used by DuckDuckGo, a privacy-focused search engine.
  4. YandexBot – The web crawler for Yandex, a popular search engine in Russia.
  5. Baidu Spider – The web crawler used by Baidu, the leading search engine in China.
  6. Majestic12 Bot – A crawler from Majestic used to create a backlink database.
  7. SEMrushBot – An SEMrush crawler used for SEO audits and competitive analysis.
  8. Moz’s RogerBot – Moz uses the bot for SEO data collection and analysis.
  9. Screaming Frog SEO Spider – A popular tool SEO professionals use to audit websites.
  10. Applebot – The crawler Apple uses for its web services like Siri and Spotlight Suggestions.
  11. Facebook’s Facebot – This bot crawls websites to extract content for Facebook’s social network services.
  12. LinkedIn’s LinkedInBot – Used to fetch and index content for LinkedIn’s platform.
  13. Pinterest’s PinterestBot – Crawls websites to gather images and content for Pinterest’s database.

These bots all play different roles depending on the platform or service they’re associated with. Still, they all need access to your site to perform their tasks effectively.

How to Spot Crawlability Issues on Your Site

The simplest way to identify crawlability issues is by using Google Search Console.

This tool scans your entire website, flags any issues, and categorizes them, allowing you to quickly understand your site’s SEO status and why certain pages may not be getting crawled.

What’s the difference between crawlability and indexability?

Crawlability refers to a search engine’s ability to access and scan a web page’s content, while indexability is about the search engine’s ability to analyze that content and include it in its index. A page can be crawlable but not indexable.

Can Google index a web page without crawling?

Yes, it’s possible, though uncommon, for Google to index a URL without fully crawling it. In such cases, the URL might appear in search results based on anchor text and the URL itself, but the page’s title and description won’t be shown. Google explains in their introduction to robots.txt that while blocked content won’t be crawled or indexed, the URL might still be indexed if it’s linked elsewhere on the web.

Factors Affecting Crawlability

Crawlability is influenced by several key factors that determine how easily search engines can navigate and understand your site; here are common factors that impact crawling:

Page Discoverability

To effectively do its job, a crawler must first be aware of your page’s existence.

Crawlers can’t discover, crawl, or index pages that aren’t in your sitemap or lack internal links—often referred to as orphan pages.

To make sure your pages are indexable, include them in your sitemap and link them internally—ideally, do both.

Googlebot skips links marked with the “rel=nofollow” attribute.

If your page is only linked through nofollow links, it’s effectively invisible to crawlers.

Robots.txt File

The robots.txt file on your site guides crawlers on which areas they can access and which they must avoid.

If a page is disallowed in robots.txt, crawlers won’t be able to access it.

However, blocking a page with robots.txt doesn’t necessarily prevent its URL from appearing in search results.

If enough other pages link to a blocked URL, Google may still include it in search results.

Even though Google won’t know the page’s content, the URL can still appear in search results, so it’s crucial to manage sensitive or irrelevant pages carefully.

Hierarchy of Crawler Checks

When Google discovers a new page, it follows a set sequence to decide whether to crawl and index it.

The process begins by examining the robots.txt file, then reviews the HTTP headers, and finally inspects the meta tags.

This order is vital as it guides the decisions made by Google’s crawler, underscoring the importance of configuring these elements accurately.

By correctly setting up your robots.txt file, HTTP headers, and meta tags, you can effectively control which pages are indexed and which are not.

HTTP Status Codes Influence on Crawling

The status codes in your HTTP headers directly influence Google’s crawling decisions.

For instance, a status code 200 signals that the page is ready to be crawled, while a status code like 307, indicating a temporary redirect, stops the current URL from being crawled.

Correctly configuring your HTTP status codes is crucial for managing crawlability and preventing redirects from unintentionally blocking pages you want Google to crawl.

Access Restrictions

Some web pages have specific restrictions that prevent crawlers from accessing them, such as:

  • Login requirements
  • User-agent blacklisting
  • IP address blacklisting

These restrictions must be carefully managed to ensure that the correct pages are accessible to crawlers while keeping sensitive or irrelevant pages out of search results.

Written by Lawrence Hitches

Posted 21 August 2024

Lawrence an SEO professional and the General Manager of Australia’s Largest SEO Agency – StudioHawk; he’s been working in search for eight years, having started working with Bing Search to improve their algorithm. Then, jumping over to working on small, medium, and enterprise businesses with SEO tactics to reach more customers on search engines such as Google, he’s won the Young Search Professional of the Year from the Semrush Awards and Best Large SEO Agency at the Global Search Awards.

He’s now focused on educating those who want to learn about SEO with the techniques and tips he’s learned from experience and continuing to learn new tactics as search evolves.