C.2 - Searching the web (6 hours)

C.2.1 - Define the term search engine

C.2.2 - Distinguish between the surface web and the deep web

C.2.3 - Outline the principles of searching algorithms used by search engines. 

Note: Students will be only be expected to understand the principles of the PageRank and HITS algorithms. 

How Google's PageRank Algorithm works

Hubs and Authorities

C.2.4 - Describe how a web crawler functions (bots, web-spiders, web-robots)


Google's servers must "crawl" (examine) "all" the web-pages in the entire web (visible pages) -at least that is the goal.  The servers run "spider" programs that "visit" a web-page, then follow all the links on that page, and then follow all the links on those pages (recursively). Eventually the spider must stop - maybe after "10 iterations" - and then "return home". At each web-page, the spider does some or all of the following:


How Google Works  (Web crawler - Indexer - Query processor)


Tasks

1. Try out this search engine spider simulator  (toolsoftheweb) to see how Google sees a website

2. Try out one of these web-crawling tools -  Top 20 web-crawling tools - Octoparse - April 2017. Make a Google slide presentation on how this crawling program works and share with the rest of the group.

Extra

C.2.5 - Discuss the relationship between data in a meta-tag and how it is accessed by a web crawler

A special HTML tag that provides information about a Web page. Unlike normal HTML tags, meta tags do not affect how the page is displayed.

The title tag, not strictly a meta-tag, is what is shown in the results, through the indexer.

The description meta-tag provides the indexer with a short description of the page.

The keywords meta-tag provides…well keywords about your page.

While meta-tags used to play a role in ranking, this has been overused by many pages and therefore meta-tags are not considered by most search engines anymore. Crawlers now mostly use meta-tags to compare keywords and description to the content of the page to give it a certain weight. For this reason, while meta-tags do not play the big role it used to, it’s still important to include them.

Meta tags that Google understands - Webmasters

C.2.6 - Discuss the use of parallel web crawling

A parallel crawler is a crawler that runs multiple processes in parallel. The goal is to maximise the download rate while minimising the extra computational time and bandwidth from parallelisation and to avoid repeated downloads of the same page. To avoid downloading the same page more than once, the crawling system requires a policy for assigning the new URLs discovered during the crawling process, as the same URL can be found by two different crawling processes.

C.2.7 - Outline the purpose of web-indexing in search engines

Web indexing (or Internet indexing) refers to various methods for indexing the contents of a website or of the Internet as a whole. Individual websites or intranets may use a back-of-the-book index, while search engines usually use keywords and metadata to provide a more useful vocabulary for Internet or onsite searching. With the increase in the number of periodicals that have articles online, web indexing is also becoming important for periodical websites. (Wikipedia article).

What is web indexing

How search works

C.2.8 - Suggest how web developers can create pages that appear more prominently in search engine results.

Search engine optimization is a methodology of strategies, techniques and tactics used to increase the amount of visitors to a website by obtaining a high-ranking placement in the search results page of a search engine (SERP) - including Google, Bing, Yahoo and other search engines.

SEO helps to ensure that a site is accessible to a search engine and improves the chances

(Webopedia)

Top 5 data testing tools in 2022

7 advanced SEO techniques - Biztech - Jan 2022 13 of the best SEO tools for auditing and monitoring website performance - hubspot 

Intro to Structured data - Google (2022)

Practical: Test specific data in a range of search engines and examine (i) the time taken (ii) number of hits and quality of returns.

Assignment: How can a web developer improve search engine optimisation in 2022?

C.2.9 - Describe the different metrics used by search engines.

(Note - Make sure you are talking about metrics for measuring SEARCH ENGINE PERFORMANCE, as opposed to metrics for SEO.)

C.2.10 - Explain why the effectiveness of a search engine is determined by the assumptions made when developing it. 

(Note-Students will be expected to understand that the ability of a search engine to produce the required results is based primarily on the assumptions used when developing the algorithms that underpin it).

Google and other "normal" search engines assume that web-sites are connected, which means they cannot index the "deep web".

Some search engines focus on topics:

C.2.11 - Discuss the use of white hat and black hat search engine optimisation. 

(Note: Developers of search engines should have a moral responsibility to produce an objective page ranking).

Black hat SEO refers to attempts to improve rankings in ways that are not approved by search engines and involve deception. They go against current search engine guidelines. White hat SEO refers to use of good practice methods to achieve high search engine rankings. They comply with search engine guidelines - Diffen explanation

C.2.12 - Outline future challenges to search engines as the web continues to grow. 

(Note: Issues such as error management, lack of quality assurance of information uploaded).