Algorithms

07

Search

Search is an important area of study in computer science. Just think of how often you search for information on the Internet using Google or some other search engine. It's remarkable how much information Google's algorithms search through and how fast they deliver the results.

As the video describes, when you do a Google search, you aren't actually searching the Web, you're searching Google's index of the Web. Google's spider programs are constantly traversing the web, collecting millions of web pages and organizing them into an index. When you do a Web search Google's algorithms are searching that index.

What's the best algorithm for searching an index? An index is an ordered collection. Think of the index that comes at the back of a textbook. It is organized in alphabetical order. Each entry in the index refers to some page in the book.

Linear (or Sequential) Search  vs Binary Search

I'm thinking of a secret number between 1 and 100. Can you guess the number? I'll tell you whether your guesses are too high or too low or just right. What would be the optimal way to guess? Start at 1, and then 2,3,4 ..., Or is there a better way?

Picking the middle is a great way to divide and conquer, as you remove half the data from consideration each time you guess. But this only works if the data is already sorted, i.e. either numerically or alphabetically. If the data is not sorted, then you do have to start with checking the first item, then the second item, etc. So if the data is unsorted, then a linear (sequential) search is required. If the data is sorted, then a binary search is ideal; even though a linear search will still get the correct answer it's just not as efficient as binary search.

Binary search is more complex but it is much faster. However, the list must be in a sorted order for a binary search to work. Linear search is slower but works with any list in any order. Below are implementations of linear and binary searches. Both searches are set up return the position of target in list if found, otherwise -1 if not found. It will also return the count of the number of comparisons made (guesses). You do not need to understand how binary works, only that is checking the middle position (mid), and then searching either the top or bottom remaining half, and that the data must be sorted for binary to work.

Linear search algorithms check each element of a list, in order, until the desired value is found or all elements in the list have been checked.

The binary search algorithm starts at the middle of a sorted data set of numbers and eliminates half of the data; this process repeats until the desired value is found or all elements have been eliminated. 

Sorting

Sorting is a very important area of study in computer science, even though it will not be on the AP CSP exam. As we saw above, on Search, it is much more efficient to search a collection if its elements are in order. Sorting is the process of putting objects in order. Sorting algorithms have been studied extensively by computer scientists. If you would like to see more info on search and sort, you can check this webpage, which will also be covered in AP CS A.

TClark's overall hint on sorting: just use the list.sort() function that have already been developed for the programming language you are using. The work has already been done in years past, and these functions should be as optimized as possible.

Analyzing Algorithms - Video

Algorithms and heuristic.webm

Analyzing Algorithms - Slides

D07-Analyzing Search Algorithms

Reasonable vs Intractable

The term "reasonable time" describes any algorithm that runs in polynomial time or faster. Exponential time algorithms are not considered reasonable, they are intractable.


If a problem can be solved in exponential time, there might be a different algorithm that can do it in polynomial time (that is, more quickly), but some problems can't possibly be answered in polynomial time. It's important to recognize that an exponential time algorithm still solves a problem correctly; it just takes a unreasonably long time (perhaps even hundreds of years for some inputs, for example).

Exponential time algorithms can sometimes be replaced by heuristics, which are polynomial-time algorithms that don't solve the problem exactly, but give a good enough approximation. But heuristics are useful only for certain kinds of problems.

Travelling Salesperson Problem - Solve Heuristically

An example of a problem for which a heuristic solution is useful is the traveling salesperson problem: For a group of cities, what is the shortest route for a salesperson to visit every city and return to their home city? This is good case for heuristics because:

This heuristic is called "hill climbing" because you'll find the best nearby path (the top of a hill), but there might be a higher hill (a better path) further away. I.e. it is a close enough option for practical use, without necessarily being the perfectly best option. A heuristic is an approach to a problem that produces a solution that is not guaranteed to be optimal but may be used when techniques that are guaranteed to always find an optimal solution are impractical. 

Undecidable Problems

As it turns out, not all decision problems (true/false questions) can be solved with an algorithm.


Parallel Processing

One way to speedup an algorithm's execution time is to create a parallel solution: where multiple processors execute different tasks at the same time. In CMU Graphics, you are accustomed to seeing a bunch of functions that all run independently, which may or may not be associated with different objects (i.e. the event handlers onMousePress, onStep, onKeyPress, etc.). This is kind of like parallel computing. So, if we had a different computer for each event, that would be true parallelism. As it is, there is only one computer, and it divides its attention among the processes by running a little bit of one and then running a little bit of the next one. Specifically, it switches at the bottom of loops (forever, repeat, etc.).

Because every computation includes a sequential portion, there is a limit to how much you can speed up a task by adding processors.

As an example, suppose you want to know the average word length in list of 100,000 words. You can divide the task among several computers (one for each starting letter). Each computer adds the lengths of all the words assigned to it (all the "A" words, all the "B" words, etc). Then one computer has to add the 26 partial results and divide by the total number of words to find the average. To calculate the total run time of this parallel solution, you would add the run time of the longest parallel portion (the run time for the letter with the most words) to the run time of the sequential portion (adding the 26 partial results and dividing the sum by the total number of words).

Still Curious?