Before optimizing a website to search engine friendly, one must understand how the crawler works. This section guides you on understanding how search engine crawler works.
Now search engine does not just crawl into your website and magically does the ranking. There are 3 known stages of work:
Search engine starts off by crawling into your website using their proprietary technologies (e.g. Google Spider). Then, it learns the content and index it inside the search engine database (called "Caffeine"). Lastly, the search engine ranks the contents based on relevancy to the query.
Keep in mind that not all crawlers are the same. One method of crawling does not match the others (e.g. Google vs. DuckDuckGo). Hence, you need to pick the right engine where most of the targeted audience uses.
To keep the website authentic, crawler does follow links and map all incoming links (called backlink). Backlink works similarly like "word of mouths" where:
The ranking (RankBrain) is constantly being upgraded so SEO is actually a continuous efforts.
Some search engine like Google has a feature for local listing, this helps to create local search results based on relevancy, proximity, and distance. It also covers reviews, citation that helps boost the search ranking.
Many search engine provides a way to perform optimization and validation. Here are some examples:
There are various ways that the website fails to appear:
Although a crawler does not belongs to you, you can still provide instruction to it and crawl the intended websites. Some search engine like Google has a crawl budget. Hence, you would want to provide good instructions to crawler to crawl effectively.
IMPORTANT: not all crawlers comply to instructions (e.g. email scrapper). Hence, you need to be careful not to assume that these instructions works unilaterally. After all, the crawler is not owned by you.
Sitemap, as the name implies, provides a navigation mapping for crawler to seek out the websites. Sitemap can be made available anywhere but one must be available in the root repository.
Sitemap is an xml file with strict name: sitemap.xml. If you are using other names, you need to to instruct the crawler using robots.txt
. Here is an example of a sitemap file:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://www.example.com/foo.html</loc>
<lastmod>2018-06-04</lastmod>
</url>
</urlset>
One easy way is to instruct the crawler is via the robots.txt
file. This filename is very sensitive so you should only keep it as it is (lowercase plural robot with .txt extension). The location of the file should always be at the root repository, affecting all the sub-directories.
Inside the robots.txt, you instruct various robots with the specific instructions, from top to bottom, first matching. Example:
# Group 1
User-agent: Unnecessarybot
Disallow: /
# Group 2
User-agent: Googlebot
Disallow: /nogooglebot/
# Group 3
User-agent: *
Allow: /
Sitemap: http://www.example.com/sitemap.xml
Based on the example above, when a crawler called "Unnecessarybot" enters, it hits the first rule and received the instruction not to crawl the entire site. If the Googlebot crawler enters, it received the instruction not to crawl /nogooglebot/
pathing but is allowed to crawl all others. Any other crawlers are then allowed to crawl the entire site.
You can also instruct the crawler using the page specific meta tags or header. Some valid directives are: all, noindex, nofollow, none, noarchive, nosnippet, max-snippet, max-image-preview, noimageindex, etc. (See https://developers.google.com/search/reference/robots_meta_tag).
Based on the example above, the meta tag equivalent would be:
<meta name="googleBot" content="noindex" />
<meta name="unnecessaryBot" content="noindex, nofollow" />
Similarly, one can use X-Header to do the same effect. Example:
HTTP/1.1 200 OK
...
X-Robots-Tag: googlebot: nofollow
X-Robots-Tag: unnecessaryBot: noindex, nofollow
...
Although it is easy to instruct crawler, there are a lot of common mistakes done by optimizer causing the crawler failed to crawl.
Crawler will not crawl web contents shielded by walls like authentication logins or pay wall. It does not know how to login and it would not bother to do so.
Crawler will not crawl by performing search-box submission.
If important texts (like keyword) is embedded into an image, crawler will not be able to pick it up.
The navigation for the website is straight up poor and bad to the point where the crawler does not know how to navigate around.
Mapping of both mobile-focused and desktop-focused websites requires 2 different sets of instructions. Hence it is always better to create responsive pages.
There are no sitemaps available or the sitemaps is a bad one.
There are pages causing redirect chain (e.g. A to B then to C). It's best to simplify it from (A to C).
The website was unavailable before setting up a 301 redirect.
That's all for learning how crawler works in SEO.