How Crawler Works in SEO

Stages of Work

Now search engine does not just crawl into your website and magically does the ranking. There are 3 known stages of work:

  • Crawl - seek out the website for each URL found.
  • Index - Sort, organize, and store the content found.
  • Rank - Rank the content for best answer to a query.

Search engine starts off by crawling into your website using their proprietary technologies (e.g. Google Spider). Then, it learns the content and index it inside the search engine database (called "Caffeine"). Lastly, the search engine ranks the contents based on relevancy to the query.

All Crawler Are Not The Same

Keep in mind that not all crawlers are the same. One method of crawling does not match the others (e.g. Google vs. DuckDuckGo). Hence, you need to pick the right engine where most of the targeted audience uses.

Backlink / Inbound Link For Authority Measurements

To keep the website authentic, crawler does follow links and map all incoming links (called backlink). Backlink works similarly like "word of mouths" where:

  • Referrer from others: good sign of authority
    • Many different people claimed Jenny's Coffee is best in town.
  • Referrer from yourself: biased claim.
    • Jenny claimed Jenny's Coffee is best in town.
  • Referrer from irrelevant or low quality sources: not good at all.
    • Some random people claimed Jenny's Coffee is best in town (or Jenny paid them to do so).
  • No Referral: unclear, neutral.
    • Jenny's Coffee may be good it is up to the reader to do the judgement. No side opinions available.

Constant Evolution

The ranking (RankBrain) is constantly being upgraded so SEO is actually a continuous efforts.

Localized Listing

Some search engine like Google has a feature for local listing, this helps to create local search results based on relevancy, proximity, and distance. It also covers reviews, citation that helps boost the search ranking.

Checking SEO

Many search engine provides a way to perform optimization and validation. Here are some examples:

When Site Fails To Appear In Search

There are various ways that the website fails to appear:

  • The website haven't been crawled yet. It's too new! (Usually 7-8 days)
  • The website isn't linked to from any external websites.
  • The navigation in the website is inefficient/ineffective for the crawler (bad for crawler to identify pathing).
  • The website contains instructions to crawler not to crawl the website.
  • The website has been penalized for spamming tactics by the search engine.

Instructing Crawler

Although a crawler does not belongs to you, you can still provide instruction to it and crawl the intended websites. Some search engine like Google has a crawl budget. Hence, you would want to provide good instructions to crawler to crawl effectively.

IMPORTANT: not all crawlers comply to instructions (e.g. email scrapper). Hence, you need to be careful not to assume that these instructions works unilaterally. After all, the crawler is not owned by you.

Control using Sitemap

Sitemap, as the name implies, provides a navigation mapping for crawler to seek out the websites. Sitemap can be made available anywhere but one must be available in the root repository.

Sitemap is an xml file with strict name: sitemap.xml. If you are using other names, you need to to instruct the crawler using robots.txt. Here is an example of a sitemap file:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> 
  <url>
    <loc>http://www.example.com/foo.html</loc>
    <lastmod>2018-06-04</lastmod>
  </url>
</urlset>

Control via robots.txt

One easy way is to instruct the crawler is via the robots.txt file. This filename is very sensitive so you should only keep it as it is (lowercase plural robot with .txt extension). The location of the file should always be at the root repository, affecting all the sub-directories.

Inside the robots.txt, you instruct various robots with the specific instructions, from top to bottom, first matching. Example:

# Group 1
User-agent: Unnecessarybot
Disallow: /

# Group 2
User-agent: Googlebot
Disallow: /nogooglebot/

# Group 3
User-agent: *
Allow: /

Sitemap: http://www.example.com/sitemap.xml

Based on the example above, when a crawler called "Unnecessarybot" enters, it hits the first rule and received the instruction not to crawl the entire site. If the Googlebot crawler enters, it received the instruction not to crawl /nogooglebot/ pathing but is allowed to crawl all others. Any other crawlers are then allowed to crawl the entire site.

Control via Meta Tags / Headers

You can also instruct the crawler using the page specific meta tags or header. Some valid directives are: all, noindex, nofollow, none, noarchive, nosnippet, max-snippet, max-image-preview, noimageindex, etc. (See https://developers.google.com/search/reference/robots_meta_tag).

Based on the example above, the meta tag equivalent would be:

<meta name="googleBot" content="noindex" />
<meta name="unnecessaryBot" content="noindex, nofollow" />

Similarly, one can use X-Header to do the same effect. Example:

HTTP/1.1 200 OK
...
X-Robots-Tag: googlebot: nofollow
X-Robots-Tag: unnecessaryBot: noindex, nofollow
...

Common Caveats For Crawling

Although it is easy to instruct crawler, there are a lot of common mistakes done by optimizer causing the crawler failed to crawl.

Contents Locked Up

Crawler will not crawl web contents shielded by walls like authentication logins or pay wall. It does not know how to login and it would not bother to do so.

Navigation Relies On Internal Search Form

Crawler will not crawl by performing search-box submission.

Hidden Text From Visible Contents

If important texts (like keyword) is embedded into an image, crawler will not be able to pick it up.

Poor Navigation and Sitemap

The navigation for the website is straight up poor and bad to the point where the crawler does not know how to navigate around.

Mobile and Desktop Navigation Are Different

Mapping of both mobile-focused and desktop-focused websites requires 2 different sets of instructions. Hence it is always better to create responsive pages.

Bad or Missing Sitemap

There are no sitemaps available or the sitemaps is a bad one.

You Have Redirect Chain

There are pages causing redirect chain (e.g. A to B then to C). It's best to simplify it from (A to C).

Page Was Removed Before Setting Up Redirect 301

The website was unavailable before setting up a 301 redirect.

That's all for learning how crawler works in SEO.