This project aims to analyze the time complexity of three well known sorting techniques in computer science.
Although some sorting algorithms have the same time complexity, in practice, their performance varies depending on the order of the data. This project uses analytical tools such as Pandas, Seaborn, and Matplotlib to compare the performance of bubble, insertion, and selection sort on a case by case basis.
The three sorting algorithms were programmed in Java and the tests were performed on an older AMD A6-3400M processor. The test results were written to the a CSV formatted file.
The test was comprised of sorting integer arrays of varying sizes and orders. The length of the arrays consisted of 10 000, 25 000, 50 000, and 100 000 elements. The possible ordering of the array elements, or the array types, includes random, mostly ascending, ascending, fixed, and descending orders. A test to sort the array in ascending order was performed for each algorithm, array type, and array size combination.
The data encompasses a total of 540 unique tests across 60 rows. Each row represents an observation of nine tests and their corresponding time averages. The data and source code is available on GitHub.
Shape
The data is tabular (or rectangular). The sort type and array type columns are qualitative while the remaining columns are quantitative.
The the average time column and the sequential test time columns are continuous while the array size is discrete.
Granularity
The granularity of the data is that each row is a set of nine tests. Additionally, each row is a set of unique tests for each sort type, array type, and array size combination. The data is coarsed-grained because each numbered test time column can be listed individually. However, for the purpose of this project, the average of the test times is the variable of interest.
Scope and Completeness
The goal of the project is to analyze and compare the time performance of three sorting algorithms with different ordered data. The data consists of six possible ways a data structure could be ordered: random, ascending, descending, equal, and mostly ascending. Each test (row) is a set of nine test times and their average. Hence, the data is within the scope of project's goal and sufficient to examine the time performance of each respective algorithm.
Temporality
The results of the tests were recorded and written to a CSV formatted file on the day this project was published. It is also worth mentioning that time does not have a significant impact on the data. This is because the data describes general trends in the time performance of these three algorithms when sorting data in different order, as opposed to their performance for specific use case scenarios. Thus, unless a major breakthrough occurs in computer science, the general trends described by the data is true even as more power computers are created.
Faithfulness
There are no problems with the faithfulness of the data. The data does not contain: outliers, missing or unrealistic values, misspellings or falsified results. Additionally, the data tries to capture the reality by describing general trends in the time performance of the three sorting algorithms with different ordered data. In reality, the time performance may be different, but given some degree of variation, the shape of the graphs should look similar in real life scenarios. Anyone is free to try to reproduce the results observed in this project. The source code is freely available on GitHub.
Cleanliness
Since the tests results were formatted specifically for this project, the data does not need cleaning.
Distribution of test times and their average
The histograms shows the distributions of all the test times and their average respectively. It should be noted that although some of the test times in the histograms appear to be outliers, they are, in fact, well within the scope of the data. This is because they capture crucial test times about the algorithms worse case scenarios. These test times are important to consider for evaluating the strengths and weaknesses of each individual algorithm. More on this on the follow sections.
As illustrated by the line plots, the time performance of the sorting algorithms is heavily affected by the order of the data. The line shadow represents the confidence interval or the range of values by which the average time may vary on subsequent tests. This project explores ways to take advantage of the confidence interval and the variability of the test times to get lower averages. As demonstrated by the plot, the variability is higher for random and decreasing array types because these data orders represents the worse case scenario of the sorting algorithms. Also, since the relationship of the average time and the array size is quadratic, the variability is also quadratic, which means the confidence interval is lower. This data can be used to capitalize on the window for variation by taking advantage of the strengths each respective algorithm has to offer to yield lower average times.
The bar chart shows the average time performance of the algorithms with different ordered data. As expected, all the algorithms performed well with arrays in equal and increasing orders as they were already in sorted order.
It is worth noting that random array types have the worse average time by a few hundreds milliseconds. The algorithms sorted the arrays in ascending order, as such, descending array type was the worse case scenario because the algorithms had to swap every element, thus executing the maximum number of operations. The bar chart shows surprising results because in a random ordered array, on average, one would expect most elements to be out of order but there should be at least some elements in sorted order. Hence, I was expecting the random and decreasing bars in the chart to be trading places.
However, on a closer inspection, my implementation of the decreasing array type used a for-loop to decrement an initial random value by one for array size number of times. Hence, the array was in sequential descending order. I suspect this is the reason for the unexpected results because although the sorting algorithms swapped every element, the sequential aspect meant that for every iteration, the number of comparisons was linear. There is no such pattern for the random array types.
The approach employed by selection and insertion sort performed approximately 6 to 10 times better than bubble sort when sorting data arranged in various orders in this particular case.
Let's look at a side by side comparison of the time performance for each algorithm.
As you might have noticed, bubble and insertion sort performed particularly well with equal and increasing array types, yet selection sort not so much. There is a good reason for it. Due to the nature of bubble and insertion sort, it is possible to terminate the sorting operation as soon as the array has been sorted. In the case of bubble sort, for every iteration, the algorithm keeps track if a swap occurred, and if not, it means the array is sorted and the algorithm terminates immediately. The same is true of insertion sort.
Moreover, insertion sort performed considerably better than selection sort with sorted and mostly sorted arrays. But selection sort did better than insertion sort with decreasing arrays. Both of their performance was comparable with random array types.
On average, the time complexity of all three sorting algorithms can be described with N². The time it takes to sort a data structure increases exponentially with the size of the data.
The results of this analysis demonstrated that despite all three algorithms sharing the same time complexity, their performance varies considerably depending on the order of the data. Each algorithm has strengths and weaknesses, and although, on average, their performed can be explained by N², it is possible to get lower average times by taking into consideration each algorithm's strength and weakness as illustrated by the bar charts.