Crawling is the initial step in a search engine finding your website’s content. Without crawling, your site won’t be indexed or ranked, resulting in no traffic.
Google announced in 2016 that they have knowledge of 130 trillion pages, a process that is remarkably complicated and completely automatic.
However, research from Ahrefs shows that 96.55% of content receives zero traffic from Google. This indicates that search engines like Google only crawl and index select content.
What is Search Engine Crawling?
Web crawling, as defined by Google, involves discovering and downloading text, images, and videos from online web pages using search engine web crawlers, also known as spiders. Web crawling ensures that these bots efficiently index content for search engine visibility.”
These crawlers find new pages by travelling through URLs (links). They crawl sitemaps, internal links and backlinks to find additional pages that haven’t been crawled. Once they find a new page, they extract the information to index it in their database.
Different Search Engine Bots & User Agent Strings
Search engine bots are crucial to understand search engine crawling, and recognizing user agent strings related to those search engine bots enhances this knowledge.
Web crawlers, or bots, are automated tools that find new web pages across the internet. By identifying themselves with user-agent strings, they help index online content for better search engine results.
Blocking access means no rank on search engines!
Search Engine | Bot Name | User Agent String Example | Purpose |
Googlebot | Googlebot/2.1 (+http://www.google.com/bot.html) | Googlebot crawls and indexes web pages to help improve Google Search results and other Google services. | |
Bing | Bingbot | Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) | Bingbot explores and indexes web pages to enhance Bing Search results. |
DuckDuckGo | DuckDuckBot | DuckDuckBot/1.0; (+http://duckduckgo.com/duckduckbot.html) | DuckDuckBot scans and indexes web pages to make DuckDuckGo Search better. |
Yandex | YandexBot | Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots) | YandexBot crawls and indexes web pages to improve Yandex Search. |
Baidu | Baiduspider | Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html) | Baiduspider searches and indexes web pages to enhance Baidu Search. |
Apple | Applebot | Applebot/1.0 (+http://www.apple.com/go/applebot) | Applebot explores and indexes web pages for services like Spotlight Search. |
The Stages of Search Engine Crawling
Search engines work through a process called crawling to deliver relevant search results. This involves five steps: discovery, prioritization, crawling, data extraction, and indexing.
Step One: Discovery
The URL discovery process kicks off web crawling. When starting out, search engines like Google use bots, like Googlebot, to track down new, uncrawled pages.
Optimizing crawlability is crucial because search engines typically have a huge queue of URLs waiting to be crawled. Ensuring your page gets crawled the first time highlights the importance of optimizing crawlability.
Search engine crawling involves several methods to discover new content.
Primarily, search engines might re-crawl previously indexed URLs for updates or changes. Additionally, they crawl a website’s XML sitemap to uncover new pages.
Another approach search engines utilize involves crawling internal and external links to identify new pages.
Remember, however, “Crawling is not a guarantee you’re indexed.” – Rand Fishkin
Search engine indexing ensures your web pages are added to search results, but it’s not automatic. Your pages must meet specific criteria for search engines to index them.
Step Two: Fetching
Search engine crawlers kick off by picking a URL from their list. They then request the page from the web server, grabbing and indexing its content to boost search engine results.
HTML code is sent by the server as the page’s content, usually in HTML format. This HTML code includes the page’s structure and text. Additionally, HTML code may contain links to other resources like images, CSS, stylesheets, and JavaScript files.
Step Three: Parsing
Web crawling involves fetching the page, where the crawler then parses the HTML content to extract information, a crucial step in web crawling.
The information extracted includes the following:
- Links are identified within the HTML code, including both internal and backlinks. These links are then added to the discover queue for future crawling.
- Web crawlers extract resources from embedded HTML resources, including images, CSS stylesheets, and JavaScript files. After extraction, web crawlers fetch and analyze these resources separately.
- Metadata, such as a page’s title, description, and keywords, is extracted to understand the page’s content and relevance, making metadata crucial for page analysis.
Step 4: Rendering
Web crawlers struggle with accurately reading content on modern websites built using JavaScript. This makes it tricky for them to interpret and index such sites correctly.
To understand a webpage completely, search engines need to render it first. This means they run the JavaScript code on the page, much like how a browser would show it to you. This step is crucial for search engines to grasp the full content and present it correctly.
Web crawlers gain a better understanding of web page content, including dynamically generated elements, through this.
However, when web crawlers encounter an HTTP error such as HTTP 500, their crawl efficiency may decrease. During these instances, web crawlers typically adjust their crawl rate to prevent server overload.
Step 5: Indexing
Search engine indexing is when a search engine stores information from web pages in its index.
This helps the search engine’s algorithms decide which pages are most relevant to show when you search for something.
Crawling vs. Indexing
Crawling and indexing are often confused in SEO, but they are distinct processes. Knowing how each works is key to optimizing your website effectively.
Here’s a side-by-side comparison to help:
Feature | Crawling | Indexing |
Definition | The process of discovering new or updated web pages | The process of storing and organizing information from web pages |
How it works | Automated bots, known as crawlers, follow hyperlinks to explore web content | Bots analyze the content of crawled pages and extract essential information |
Sources | Hyperlinks, sitemaps, and URL submissions | Pages that have been crawled |
Goal | To collect data from web pages for inclusion in the index | To create a structured and searchable database of web content |
Control | The robots.txt file provides guidance to crawlers on which pages to access | The robots.txt file dictates which pages can be indexed |
Outcome | A comprehensive list of URLs | A searchable database that facilitates efficient retrieval of search results |
How Search Engines Discover and Index Web Pages
Now that you know more about crawling, let’s examine how search engines discover and index web pages.
Discovery is the initial phase in how search engines discover and index web pages. Though we’ve mentioned it earlier, a brief overview is essential. During this phase, crawlers need to find pages by:
- Bots belonging to search engines crawl the internet via links to discover unknown web pages.
- Crawlers investigate sitemaps to locate new pages on a website that haven’t been crawled.
- For faster crawling, you can manually submit individual URLs using search engine console tools like Google Search Console.”
Indexing
Understanding how search engines discover and index web pages is key to making your site visible online. Here’s how it works: search engines use bots to crawl links and sitemaps, or you can manually submit pages through tools like Google Search Console.
During the search engine optimization process, search engines analyze various index factors for SEO efficiency, such as content quality, relevance, and keyword usage.
SEO also prioritizes title tags, meta descriptions, and header tags (H1, H2, etc.). Internal and external links, image alt text, page load speed, mobile-friendliness, structured data, and social signals play pivotal roles in SEO.
Additionally, the freshness of content, domain authority, trustworthiness, user engagement metrics, and robots.txt file demands contribute significantly to SEO. T
here are numerous SEO factors determining web pages’ indexing efficacy; for a comprehensive list, refer to Google’s 200 Ranking Factors from Backlinko.
FAQ
How often do search engine crawlers revisit web pages?
Search engine crawlers work on their own schedules, so it’s hard to predict how often they’ll revisit. To get them back more frequently, you can manually submit your URL or sitemap and improve your internal and external links.
Can search engine crawlers access non-text files like images and videos?
Yes, they can! While search engine crawlers can’t actually view images or videos, they gather information from file names, alt text, captions, and surrounding text to understand the content. This helps them index these files appropriately.
How do search engines discover new pages?
Search engines find new pages using crawlers. These crawlers track down pages by following links within a site, links from other sites, sitemaps, and sometimes through manual submissions. This way, search engines can efficiently discover new content.
Need Help with Your SEO?
Not sure if your website is optimized for crawling?
Reach out for a free SEO audit to spot issues and boost your chances of getting indexed and ranked on search engines.