C.2 - Searching the web (6 hours)
C.2.1 - Define the term search engine
C.2.2 - Distinguish between the surface web and the deep web
How the Deep Web works - Howstuffworks
C.2.3 - Outline the principles of searching algorithms used by search engines.
Note: Students will be only be expected to understand the principles of the PageRank and HITS algorithms.
PageRank algorithm - Wikipedia article
Hyperlinked-induced topic search (HITS) also called Hubs and Authorities - Wikipedia article
C.2.4 - Describe how a web crawler functions (bots, web-spiders, web-robots)
Google's servers must "crawl" (examine) "all" the web-pages in the entire web (visible pages) -at least that is the goal. The servers run "spider" programs that "visit" a web-page, then follow all the links on that page, and then follow all the links on those pages (recursively). Eventually the spider must stop - maybe after "10 iterations" - and then "return home". At each web-page, the spider does some or all of the following:
make a list of all the words appearing on that page
save the list of all the words (or important words) in the INDEX servers, along with the URL of the page
make a copy of the page and store it in Google's "cache"
How Google Works (Web crawler - Indexer - Query processor)
Tasks
1. Try out this search engine spider simulator (toolsoftheweb) to see how Google sees a website
2. Try out one of these web-crawling tools - Top 20 web-crawling tools - Octoparse - April 2017. Make a Google slide presentation on how this crawling program works and share with the rest of the group.
Extra
Python programmint tutorial 1 - How to build a Web Crawler - thenewboston
Python programmint tutorial 2 - How to build a Web Crawler - thenewboston
Python programmint tutorial 3 - How to build a Web Crawler - thenewboston
C.2.5 - Discuss the relationship between data in a meta-tag and how it is accessed by a web crawler
A special HTML tag that provides information about a Web page. Unlike normal HTML tags, meta tags do not affect how the page is displayed.
The title tag, not strictly a meta-tag, is what is shown in the results, through the indexer.
The description meta-tag provides the indexer with a short description of the page.
The keywords meta-tag provides…well keywords about your page.
While meta-tags used to play a role in ranking, this has been overused by many pages and therefore meta-tags are not considered by most search engines anymore. Crawlers now mostly use meta-tags to compare keywords and description to the content of the page to give it a certain weight. For this reason, while meta-tags do not play the big role it used to, it’s still important to include them.
Meta tags that Google understands - Webmasters
C.2.6 - Discuss the use of parallel web crawling
A parallel crawler is a crawler that runs multiple processes in parallel. The goal is to maximise the download rate while minimising the extra computational time and bandwidth from parallelisation and to avoid repeated downloads of the same page. To avoid downloading the same page more than once, the crawling system requires a policy for assigning the new URLs discovered during the crawling process, as the same URL can be found by two different crawling processes.
C.2.7 - Outline the purpose of web-indexing in search engines
Web indexing (or Internet indexing) refers to various methods for indexing the contents of a website or of the Internet as a whole. Individual websites or intranets may use a back-of-the-book index, while search engines usually use keywords and metadata to provide a more useful vocabulary for Internet or onsite searching. With the increase in the number of periodicals that have articles online, web indexing is also becoming important for periodical websites. (Wikipedia article).
What is web indexing
How search works
C.2.8 - Suggest how web developers can create pages that appear more prominently in search engine results.
Search engine optimization is a methodology of strategies, techniques and tactics used to increase the amount of visitors to a website by obtaining a high-ranking placement in the search results page of a search engine (SERP) - including Google, Bing, Yahoo and other search engines.
SEO helps to ensure that a site is accessible to a search engine and improves the chances
Top 5 data testing tools in 2022
7 advanced SEO techniques - Biztech - Jan 2022 13 of the best SEO tools for auditing and monitoring website performance - hubspot
Intro to Structured data - Google (2022)
Practical: Test specific data in a range of search engines and examine (i) the time taken (ii) number of hits and quality of returns.
Assignment: How can a web developer improve search engine optimisation in 2022?
C.2.9 - Describe the different metrics used by search engines.
(Note - Make sure you are talking about metrics for measuring SEARCH ENGINE PERFORMANCE, as opposed to metrics for SEO.)
Beginners Guide to measuring search engine performance metrics - Tortoise & Hare software - Feb 2021
10 Google analytics metrics you absolutely must track - 29 March 2021
C.2.10 - Explain why the effectiveness of a search engine is determined by the assumptions made when developing it.
(Note-Students will be expected to understand that the ability of a search engine to produce the required results is based primarily on the assumptions used when developing the algorithms that underpin it).
Google and other "normal" search engines assume that web-sites are connected, which means they cannot index the "deep web".
Some search engines focus on topics:
scholar.google.com is a search engine that concentrates on academic papers
WAP search engines focus on results for Smartphone users
http://www.social-searcher.com/ concentrates on social-networking sites
C.2.11 - Discuss the use of white hat and black hat search engine optimisation.
(Note: Developers of search engines should have a moral responsibility to produce an objective page ranking).
Black hat SEO refers to attempts to improve rankings in ways that are not approved by search engines and involve deception. They go against current search engine guidelines. White hat SEO refers to use of good practice methods to achieve high search engine rankings. They comply with search engine guidelines - Diffen explanation
C.2.12 - Outline future challenges to search engines as the web continues to grow.
(Note: Issues such as error management, lack of quality assurance of information uploaded).
Future challenges to search engines (Google doc)
Google outlines the future of its search engine - Financial Times - 18 Aug 2021