Search Sites

In this session we will review a number of the thousands of search websites available, use some of them, compare the results and use filtering by date and using boolean like operators. First we will review how the results you get when you search are ascertained by "spiders" / "webcrawlers" and what these programs do on the web to get your results.

For this session you might send me what you might like us to search for using the various search engines. What you send me via email need not be related to any of your cases or legal work, but topics of interest to you that might be fun for us to research.

What are "spiders" and what do they look for when indexing websites?
- To find information on the trillions of Web pages that exist, a search engine employs special software robots, called spiders, to build lists of the words found on Web sites. When a spider is building its lists, the process is called Webcrawling. We will look at the html code of some websites to see where the programmers of the site have placed meta keywords for the spiders to capture and put in the index.
- here is an example of the code on the Organization of Legal Professionals home page:
- <meta name="Keywords" content="CLE, continuing legal education, E-Discovery certification, EDD, electronic data discovery, litigation webinars, professional development, law education & support, training & certification exams, CLE requirements, online continuing legal education, law seminars & training, The Organization of Legal Professionals, OLP, ediscovery, law vendors, compliance courses, in-house legal department, CPE, continuing professional education, lawyer assistant, litigation support survey, ALSP, Association of Legal Support Professionals, e-discovery training, Chere Estrin"/><meta name="Description" content="The Organization of Legal Professionals is an organization dedicated to higher continuing legal education and certification exams. We offer comprehensive webinars, online courses, and training for litigation support, trial presentation, and E-Discovery. The OLP also offers the first online litigation support salary & utilization survey designed to give employees inside information to move your career forward." /><link rel="search" type="application/opensearchdescription+xml" title="theolp.wildapricot.org" href="/opensearch.ashx" /></head>

- Before a search engine can tell you where a file or document is, it must be found. To find information on the hundreds of millions of Web pages that exist, a search engine employs special software robots, called spiders, to build lists of the words found on Web sites. When a spider is building its lists, the process is called Web crawling. In order to build and maintain a useful list of words, a search engine's spiders have to look at a lot of pages.
- How does any spider start its travels over the Web? The usual starting points are lists of heavily used servers and very popular pages. The spider program will begin with a popular site, indexing the words on its pages and following every link found within the site. In this way, the spidering system quickly begins to travel, spreading out across the most widely used portions of the Web.
- Google began as an academic search engine. In the paper that describes how the system was built, Sergey Brin and Lawrence Page give an example of how quickly their spiders can work. They built their initial system to use multiple spiders, usually three at one time. Each spider could keep about 300 connections to Web pages open at a time. At its peak performance, using four spiders, their system could crawl over 100 pages per second, generating around 600 kilobytes of data each second.
- Keeping everything running quickly meant building a system to feed necessary information to the spiders. The early Google system had a server dedicated to providing URLs to the spiders. Rather than depending on an Internet service provider for the domain name server (DNS) that translates a server's name into an address, Google had its own DNS, in order to keep delays to a minimum.
- When the Google spider looked at an HTML page, it took note of two things:
  - The words within the page
  - Where the words were found
- Words occurring in the title, subtitles, meta tags and other positions of relative importance were noted for special consideration during a subsequent user search. The Google spider was built to index every significant word on a page, leaving out the articles "a," "an" and "the." Other spiders take different approaches.
- These different approaches usually attempt to make the spider operate faster, allow users to search more efficiently, or both. For example, some spiders will keep track of the words in the title, sub-headings and links, along with the 100 most frequently used words on the page and each word in the first 20 lines of text. Lycos is said to use this approach to spidering the Web.
- Other systems, such as AltaVista, go in the other direction, indexing every single word on a page, including "a," "an," "the" and other "insignificant" words. The push to completeness in this approach is matched by other systems in the attention given to the unseen portion of the Web page, the meta tags.

Google
Google Scholar
Bing
DogPile
DuckDuckGo
Yahoo
BeenVerified
TruthFinder
CheckMate
PeopleChecker
Intelius
Swoogle

Alexa is a global ranking system that utilizes web traffic data to compile a list of the most popular websites, the Alexa Rank. The lower your Alexa rank, the more popular (for example, a site with the rank of 1 has the most visitors on the internet). Many of these search sites no longer exist.