If information can be compared to a super highway then, the king of the road is the search engine. Search engines have become a handy tool in the short time in which the Internet has become immensely popular.
Search engine is a software system that is designed to search for information in the World Wide Web. The information may be a mix of webpages, images, and other types of files.
In fact, it is an anathema for the people, to refer to a reference book or making phone calls to gather information. Therefore, the present and the future generations will bank on Google, Bing, Yahoo!, MSN, AltaVista, or any other search engine websites to collect information and acquire knowledge. What happens is that when users enter any word in the search bar, the website will return with numerous search results almost instantly. Since this technology is now easily accessible, the library and the Yellow Page books are beginning to look primitive.
Google is presently the most dominant search engine and it performs more than 5 billion searches per day. Bing and Yahoo! follow next. It is the ultimate objective of any business to reach the top of the search result so that it gets the highest marketing through the Internet. This is being done nowadays by promoting on Google and other search engines and in return, these businesses get closer to the top result.
Search engines allow users to search the internet for content using keywords. Although the market is dominated by a few, there are many search engines that people can use. When a user enters a query into a search engine, a search engine results page (SERP) is returned, ranking the found pages in order of their relevance. How this ranking is done differs across search engines.
Search engines often change their algorithms (the programs that rank the results) to improve user experience. They aim to understand how users search and give them the best answer to their query. This means giving priority to the highest quality and most relevant pages.
The first search engine to come into existence was Archie. Archie was an abbreviation for the word "Archive". It was created by Alan Emtage, a student at McGill University in Montreal. The Archie search engine was a simple search engine that would keep an index of the file lists of all public FTP servers it could find. This way, users would be able to find publicly available files and download them. This provided a much better way to find files, as previously people could only know about files by simple word of mouth. The problem with the Archic search engine was that due to limited space, the listings were available and not the contents of each site.
The next generation of search engines, Veronica and Jughead appeared in 1991. These search engines were developed using the Gopher protocol. The Gopher protocol was a TCP/IP application layer protocol designed for distributing, searching, and retrieving documents over the internet. Jughead was different from Veronica because it used to search a single server at a time. The Veronica database could be searched from most major Gopher menus. Both Jughead and Veronica lost relevance when the Gopher protocol got replaced by HTML (Hyper Text Markup Language).
Archie and Veronica gave way to VLib, Excite and ALIWEB that came in the period 1992-93 when the World Wide Web (WWW) was taking a final shape. VLib was set up by Lee from Time Berners as a virtual online library which was hosted on the CERN web server in Switzerland. Excite was created by undergraduates from Stanford University. But, the first major breakthrough was achieved by Martijn Koster who created ALIWEB. This was an advanced search engine and had the ability to crawl Meta data. Users were able to submit the pages that they wanted through this search tool.
Most of the significant search engines like Infoseek, AltaVista, WebCrawler, Yahoo Search and Lycos were developed in 1994. These search engines had many features that provided meaningful listings based on the keywords searched. WebCrawler was the first crawler that indexed entire pages. It was bought over by the internet giant AOL. Lycos went public with a catalog of 54,000 documents in July 1994. By August of 1994, they had identified 3,94,000 documents which expanded to 60 million documents by 1996.
With a lesser 20 million webpages indexing, AltaVista was more popular than Lycos. But, Alta Vista changed hands several times going from Digital Equipments to Compaq to Overture before finally being bought over by Yahoo. After getting eclipsed by Google, AltaVista was finally given the burial on 8 July 2013.
Like Excite, Yahoo! was also a Stanford University initiative. It was founded in January 1994 by Jerry Yang and David Filo, who were electrical engineering graduate students from Stanford University. It started with a bang, but could not sustain the momentum and competition from Google and got sold to Verizon for $4.83 billion in 2015.
The search engine which has become eponymous with search and is now a generic word is Google. Sergey Brin met Larry Page in Stanford University in the summer of 1995. Larry was doing his doctoral research with Terry Winograd as his adviser. Sergey Brin was a second year grad student in the computer science department in the University. Stanford, in those days, was a hotbed of internet based entrepreneurs who were developing exciting technologies, applications and portals and becoming millionaires by incubating their ventures. Larry and Sergei embarked on a venture to crawl, rank and index the webpages which would ultimately take the shape of the now famous Google search engine.
Microsoft, although a predominantly operating system and application package developer, always had its eye on the search engine market. They initially tried to capture the market with their tool called MSN (Microsoft Network), which could not stand up to the competition from Google and Yahoo. Microsoft poured in more resources into research and development to develop a better search engine and MSN metamorphosed to Windows Live Search, Live Search and finally Bing. Initially, Microsoft was interested to buy Yahoo! But after the deal fell through, they concentrated on making Bing a better alternative to Yahoo! Their efforts finally paid off and now Bing is the second most popular search engine after Google.
There are more than a thousand search engines which are both in specialized and generalized. segments. The exact number is difficult to determine as there is no official registry of search engines. Google, Bing, Yahoo, Baidu and Ask are the leaders of the search engine market, and others handle minuscule portions in the market. Google alone accounts for 115 billion searches in a month and has the lion's share (65.4 per cent) of the market although Bing (15.8 per cent) from Microsoft is also fast catching up. We also have relatively new search engines from Quora, Yandex, Slide Share and Vimeo who are gaining popularity. But, Google has till now succeeded to sustain at the top because of its huge R & D efforts and keeping the search engine updated with relevant information.
1. Crawling and Data Mining
Crawling the Web by bots, also called spiders, is the process in which small programmed entities go out from the central computer to collect data. They are pre-programmed to start at one website and collect all its information and links. Those links are then recorded. That list of links then becomes the order in which the bot will continue its path of data collection. So, a spider might start at lifepacific.edu, but the links to the Foursquare denomination and to WASC on the homepage become the next places that the spider will go after processing everything under the lifepacific.edu domain. After the spider is full, or a set time, the bot returns and uploads the content of the webpages and all the links back to the central computer.
Data mining is the collection of all the data that the bot returned. Entire webpages, preserved in HTML, are stored on the servers of the search engine. The stored version is not the live version of the webpage, what you see when you enter the URL in your browser, but an historical version called the cached version.
Bots can be told to return to webpages often, if the content changes often. So, a website like BBC News would request that the bots return often because of the frequency that their content changes.
Bots will not find everything on the web. If there are no links to a page then it is basically invisible to search engines. If it is a web page that requires a password, or is generated as a result to a query, it will never be stored in a search engine. Those webpages that will never be searched are called deep web or the invisible web.
2. Indexing
Indexing is the process of recording EVERY word and character in a webpage and its location. The same concept is found in the back of a book, where major words are listed and what pages they occur on. The search engine version of indexing is where the word occurs within any page and its EVERY occurrence in EVERY website that has been crawled. Google's index, the largest known internet index, called the Big Table, is so large it has to have indices to the indices; there is a huge amount of data present.
The indexing process, not only cites locations, but converts everything in numbers. Computers function on 1's and 0's, not on the English alphabet, or any other for that matter. The process of converting the words to numbers is important, because the process of searching is not based on words and letters, but on math.
3. Query Processing
The query, what you enter in the search box, has to be converted to numbers, so that the engine can process your request. Before it converts to numbers though, the search engine will get rid of several terms. Most search engines have a list of stop words, words that will not be searched. Most search engines will not search for the, and, it, be, will, etc. Those short words are just filler to the computer. If you absolutely need those words in the search then you must include them in quotation marks, or in Google add the plus sign before the term. Once the terms are converted to numbers, the engine then calculates what indexed terms are closest mathematically with what you asked for. The algorithm is complex, but it returns items based on how close it is mathematically to your query. Those closer are listed higher on the results list. Some engines will even show a percent of relevance.
Higher scores for relevance are shaped by: if the words are in the title as opposed to just being in the text, if the word occurs in bold or italics on the page, how many times the word occurs on a page, number and quality of links to that page, and if the words occur in the header (invisible cloud of tags created by the web programmer).
Something to keep in mind, you are not searching the entire internet when you search. You are only searching an index of the internet. Google has the largest index and will return billions of hits, but Yahoo is smaller and will return fewer hits. The difference is not just how many hits, but also that they are different hits. Each search engine sent bots in different directions, so they have indexed different parts of the web. Not only that, but the results list will be different because they work of different algorithms (many exist and some are guarded secrets).
It was in the year 1991 that World Wide Web (www) went live to the world. From then, till today, numerous amount of information has been uploaded, shared and accessed online. The moment you search on your favourite search engine (be it Google, Yahoo or Bing) using a specific keyword or phrase a set of results (the most significant ones) are listed before you. It certainly is not magic! Then, who do you think collects all these, sorts and displays them before you? It is the work of the search engines. In order to get a better understanding of what search engines do, we shall look into their major functionalities.
1. Crawling – discovering information on web pages
Every search engine has this vital software component called the crawler, or spider or bots that go through (or “crawls”) each of your web pages storing its contents in the search engine database. A crawler can either scan up the new contents on the web page or also locate older data. The bots crawl all the contents on web pages; often several websites at a time, following each and every hyperlink that links to both internal and external pages until it can no longer find any more information.
2. Indexing – creation of index database
After contents of the web pages are crawled, they now have to be indexed based on the occurrences various keywords. This would increase the efficiency of the search engines in accurately fetching information corresponding to a particular query in a short span of time. Every query given by the user will consist of a few phrases or keywords. While indexing the contents on a web page, common articles such as “a”, “an” and “the” are avoided and the indexed information is stored in an organized manner.
Search engine designers develop search algorithms that search for contents by looking the match between the keywords entered by the user with those found within the web page content using the index. If there is a good match between these then, search engines consider it as one of the results.
3. Results – fetching of the relevant data
The different hyperlinks displayed after you search for a particular key phrase are the results. Every search engine has its own algorithm that sort and displays the most relevant data as results. So, you may not get the same website rankings for a single keyword on various search engines. As mentioned earlier, comparison for equality of keywords is done by the algorithms using the index.
The whole working of search engines is a complex process that depends on the algorithms developed. And, with each of the search engines not entirely revealing their algorithms, it is not seemingly possible to understand how things work. But, we now know for sure that the crawlers or bots have a huge role to play!
Dr.Rushen Chahal, Prof.Jayanta Chakraborti, Digital Marketing 2.0, Himalaya Publishing House,
Anil G.S, Digital and social media marketing, Himalaya Publishing House, First edition 2019
https://www.nibusinessinfo.co.uk/content/what-search-engine-and-how-do-they-work
https://developers.google.com/search/docs/basics/how-search-works
https://www.nibusinessinfo.co.uk/content/what-search-engine-and-how-do-they-work
https://www.icmbservices.com/blog-details/what-are-the-major-functionalities-of