Why is Crawlability Important?
Crawlability is a big deal if you want your website to get noticed through organic search traffic. When a search engine can crawl your pages, it can read and analyze the content, which is the first step to getting that content into the search index.
Crawling allows a page to be indexed appropriately. I say “appropriately” because, in rare cases, Google may index a URL without fully crawling it, relying instead on the URL and anchor text from backlinks. However, when this happens, the page title and description might not display on the search engine results page (SERP).
Overview of Google Crawlers and Fetchers
Google uses a variety of crawlers and fetchers to perform tasks across its products, either automatically or when triggered by user requests.
Crawlers:A “crawler” (also known as a “robot” or “spider”) is a program designed to automatically discover and scan websites by following links from one page to another. Google’s primary crawler for Google Search is called Googlebot.
Fetchers:
Fetchers function like browsers, requesting a single URL when prompted by a user. These are essential for retrieving specific pages on demand rather than crawling an entire site.
Understanding Logs and Rules:
- Referrer Logs: Google provides tables listing the various crawlers and fetchers used across its products, helping you identify these agents in your website’s logs.
- Robots.txt Rules: The “User-agent” token in the robots.txt file specifies rules for different crawler types. Matching just one of the crawler’s tokens will apply the rule.
- User-Agent String: This string offers a detailed description of the crawler, appearing in both HTTP requests and web logs. However, because it can be easily spoofed, it’s essential to verify the authenticity of a visitor.
Common Search Crawlers
Besides Googlebot, several other bots regularly crawl websites for various purposes, such as SEO audits, content indexing, and site health monitoring. Here are some of the more commonly known ones:
- Bingbot – Microsoft’s search engine crawler, similar to Googlebot, used by Bing to index content.
- Yahoo Slurp – Yahoo is the web crawler for Yahoo, though it’s primarily based on Bing’s search engine now.
- DuckDuckBot – The crawler used by DuckDuckGo, a privacy-focused search engine.
- YandexBot – The web crawler for Yandex, a popular search engine in Russia.
- Baidu Spider – The web crawler used by Baidu, the leading search engine in China.
- Majestic12 Bot – A crawler from Majestic used to create a backlink database.
- SEMrushBot – An SEMrush crawler used for SEO audits and competitive analysis.
- Moz’s RogerBot – Moz uses the bot for SEO data collection and analysis.
- Screaming Frog SEO Spider – A popular tool SEO professionals use to audit websites.
- Applebot – The crawler Apple uses for its web services like Siri and Spotlight Suggestions.
- Facebook’s Facebot – This bot crawls websites to extract content for Facebook’s social network services.
- LinkedIn’s LinkedInBot – Used to fetch and index content for LinkedIn’s platform.
- Pinterest’s PinterestBot – Crawls websites to gather images and content for Pinterest’s database.
These bots all play different roles depending on the platform or service they’re associated with. Still, they all need access to your site to perform their tasks effectively.
How to Spot Crawlability Issues on Your Site
The simplest way to identify crawlability issues is by using Google Search Console.
This tool scans your entire website, flags any issues, and categorizes them, allowing you to quickly understand your site’s SEO status and why certain pages may not be getting crawled.
What’s the difference between crawlability and indexability?
Crawlability refers to a search engine’s ability to access and scan a web page’s content, while indexability is about the search engine’s ability to analyze that content and include it in its index. A page can be crawlable but not indexable.
Can Google index a web page without crawling?
Yes, it’s possible, though uncommon, for Google to index a URL without fully crawling it. In such cases, the URL might appear in search results based on anchor text and the URL itself, but the page’s title and description won’t be shown. Google explains in their introduction to robots.txt that while blocked content won’t be crawled or indexed, the URL might still be indexed if it’s linked elsewhere on the web.
Factors Affecting Crawlability
Crawlability is influenced by several key factors that determine how easily search engines can navigate and understand your site; here are common factors that impact crawling:
Page Discoverability
To effectively do its job, a crawler must first be aware of your page’s existence.
Crawlers can’t discover, crawl, or index pages that aren’t in your sitemap or lack internal links—often referred to as orphan pages.
To make sure your pages are indexable, include them in your sitemap and link them internally—ideally, do both.
Nofollow Links
Googlebot skips links marked with the “rel=nofollow” attribute.
If your page is only linked through nofollow links, it’s effectively invisible to crawlers.
Robots.txt File
The robots.txt file on your site guides crawlers on which areas they can access and which they must avoid.
If a page is disallowed in robots.txt, crawlers won’t be able to access it.
However, blocking a page with robots.txt doesn’t necessarily prevent its URL from appearing in search results.
If enough other pages link to a blocked URL, Google may still include it in search results.
Even though Google won’t know the page’s content, the URL can still appear in search results, so it’s crucial to manage sensitive or irrelevant pages carefully.
Hierarchy of Crawler Checks
When Google discovers a new page, it follows a set sequence to decide whether to crawl and index it.
The process begins by examining the robots.txt file, then reviews the HTTP headers, and finally inspects the meta tags.
This order is vital as it guides the decisions made by Google’s crawler, underscoring the importance of configuring these elements accurately.
By correctly setting up your robots.txt file, HTTP headers, and meta tags, you can effectively control which pages are indexed and which are not.
HTTP Status Codes Influence on Crawling
The status codes in your HTTP headers directly influence Google’s crawling decisions.
For instance, a status code 200 signals that the page is ready to be crawled, while a status code like 307, indicating a temporary redirect, stops the current URL from being crawled.
Correctly configuring your HTTP status codes is crucial for managing crawlability and preventing redirects from unintentionally blocking pages you want Google to crawl.
Access Restrictions
Some web pages have specific restrictions that prevent crawlers from accessing them, such as:
- Login requirements
- User-agent blacklisting
- IP address blacklisting
These restrictions must be carefully managed to ensure that the correct pages are accessible to crawlers while keeping sensitive or irrelevant pages out of search results.