Algorithms
07
Search
Search is an important area of study in computer science. Just think of how often you search for information on the Internet using Google or some other search engine. It's remarkable how much information Google's algorithms search through and how fast they deliver the results.
As the video describes, when you do a Google search, you aren't actually searching the Web, you're searching Google's index of the Web. Google's spider programs are constantly traversing the web, collecting millions of web pages and organizing them into an index. When you do a Web search Google's algorithms are searching that index.
What's the best algorithm for searching an index? An index is an ordered collection. Think of the index that comes at the back of a textbook. It is organized in alphabetical order. Each entry in the index refers to some page in the book.
Linear (or Sequential) Search vs Binary Search
I'm thinking of a secret number between 1 and 100. Can you guess the number? I'll tell you whether your guesses are too high or too low or just right. What would be the optimal way to guess? Start at 1, and then 2,3,4 ..., Or is there a better way?
Picking the middle is a great way to divide and conquer, as you remove half the data from consideration each time you guess. But this only works if the data is already sorted, i.e. either numerically or alphabetically. If the data is not sorted, then you do have to start with checking the first item, then the second item, etc. So if the data is unsorted, then a linear (sequential) search is required. If the data is sorted, then a binary search is ideal; even though a linear search will still get the correct answer it's just not as efficient as binary search.
Binary search is more complex but it is much faster. However, the list must be in a sorted order for a binary search to work. Linear search is slower but works with any list in any order. Below are implementations of linear and binary searches. Both searches are set up return the position of target in list if found, otherwise -1 if not found. It will also return the count of the number of comparisons made (guesses). You do not need to understand how binary works, only that is checking the middle position (mid), and then searching either the top or bottom remaining half, and that the data must be sorted for binary to work.
Linear search algorithms check each element of a list, in order, until the desired value is found or all elements in the list have been checked.
The binary search algorithm starts at the middle of a sorted data set of numbers and eliminates half of the data; this process repeats until the desired value is found or all elements have been eliminated.
Sorting
Sorting is a very important area of study in computer science, even though it will not be on the AP CSP exam. As we saw above, on Search, it is much more efficient to search a collection if its elements are in order. Sorting is the process of putting objects in order. Sorting algorithms have been studied extensively by computer scientists. If you would like to see more info on search and sort, you can check this webpage, which will also be covered in AP CS A.
TClark's overall hint on sorting: just use the list.sort() function that have already been developed for the programming language you are using. The work has already been done in years past, and these functions should be as optimized as possible.
Reasonable vs Intractable
The term "reasonable time" describes any algorithm that runs in polynomial time or faster. Exponential time algorithms are not considered reasonable, they are intractable.
Sequential search takes linear time: O(N) in green
Binary search takes logarithmic time: O(log2N) time in blue
Quicksort / mergesort take linearithmic time O(N log2N) in orange
Bubble sort / insertion sort take polynomial time: O(N2) in red
If a problem can be solved in exponential time, there might be a different algorithm that can do it in polynomial time (that is, more quickly), but some problems can't possibly be answered in polynomial time. It's important to recognize that an exponential time algorithm still solves a problem correctly; it just takes a unreasonably long time (perhaps even hundreds of years for some inputs, for example).
Exponential time algorithms can sometimes be replaced by heuristics, which are polynomial-time algorithms that don't solve the problem exactly, but give a good enough approximation. But heuristics are useful only for certain kinds of problems.
A decision problem is a problem with a true/false answer (for example, "is 5,825,496,221 a prime number?").
An optimization problem is one with the goal of finding the best solution among many (for example, "what's the best school schedule to place every student into as many of their requested classes as possible?").
Travelling Salesperson Problem - Solve Heuristically
An example of a problem for which a heuristic solution is useful is the traveling salesperson problem: For a group of cities, what is the shortest route for a salesperson to visit every city and return to their home city? This is good case for heuristics because:
It's clear that there must actually be a shortest route.
There is no known polynomial time algorithm for this problem. (Most computer scientists think that it's not possible to write one.)
There is a possible heuristic: pick a path at random, then try to improve it by swapping two cities repeatedly until you can't make the path better with such small changes.
This heuristic is called "hill climbing" because you'll find the best nearby path (the top of a hill), but there might be a higher hill (a better path) further away. I.e. it is a close enough option for practical use, without necessarily being the perfectly best option. A heuristic is an approach to a problem that produces a solution that is not guaranteed to be optimal but may be used when techniques that are guaranteed to always find an optimal solution are impractical.
Undecidable Problems
As it turns out, not all decision problems (true/false questions) can be solved with an algorithm.
A decidable problem a decision problem for which it's possible to write an algorithm that will give a correct output for all inputs.
The question "Is the integer even?" is an example of a decidable problem because it's possible to write an algorithm that will determine whether any integer is even.
An undecidable problem is the opposite. It's not possible to write an algorithm that will give a correct output for all inputs—even though it might be possible for some of them.
The question "Will this computer program that takes an input always eventually report a result?" is an undecidable problem. It's possible to write such a checking algorithm that would be able to say for some programs with some inputs whether they will report a result or get stuck in an infinite loop and never report. But it turns out that it's not possible write a checking algorithm that will work for any program with any input.
Note that an undecidable problem will not get the correct output/solution, no matter the time. So it is different than an intractable problem which can be solved with lots of time.
Parallel Processing
One way to speedup an algorithm's execution time is to create a parallel solution: where multiple processors execute different tasks at the same time. In CMU Graphics, you are accustomed to seeing a bunch of functions that all run independently, which may or may not be associated with different objects (i.e. the event handlers onMousePress, onStep, onKeyPress, etc.). This is kind of like parallel computing. So, if we had a different computer for each event, that would be true parallelism. As it is, there is only one computer, and it divides its attention among the processes by running a little bit of one and then running a little bit of the next one. Specifically, it switches at the bottom of loops (forever, repeat, etc.).
In sequential computing, operations are performed in order one at a time.
In parallel computing, the program is broken into smaller steps, some of which are performed at the same time. Modern computers have multiple processors (2, 4, or 8) in a single computer, so you can do small-scale parallel processing on the machine on your desk.
Distributed computing is a form of parallel computing that uses multiple computers (perhaps even spread out around the world).
Because every computation includes a sequential portion, there is a limit to how much you can speed up a task by adding processors.
As an example, suppose you want to know the average word length in list of 100,000 words. You can divide the task among several computers (one for each starting letter). Each computer adds the lengths of all the words assigned to it (all the "A" words, all the "B" words, etc). Then one computer has to add the 26 partial results and divide by the total number of words to find the average. To calculate the total run time of this parallel solution, you would add the run time of the longest parallel portion (the run time for the letter with the most words) to the run time of the sequential portion (adding the 26 partial results and dividing the sum by the total number of words).
Still Curious?
Try the How secure is my password site.
Do some online research to explore alternatives to passwords schemes - for example, two-factor authentication, biometrics, virtual tokens. What are their relative advantages and disadvantages?
One field of computer science that makes extensive use of heuristics is Artificial Intelligence (AI). You've probably heard of it. The field of AI traditionally tackles problems that humans are good at but computers are not (yet) good at -- for example, vision, speech recognition, natural language understanding, planning, driving, and so on. However, great progress is being made in these various areas -- just think for a moment about how well Siri and similar intelligent digital assistants work today. In fact, try asking Siri "Hey Siri, how do you solve the traveling salesman problem?". AI is a vast field