External merge sort

Introduction

Laymen explanation

Data to sort may be very big. For example, Database containing very big list of records. Complete data can't be loaded to main memory.

Assume that data is scattered among many machines. In this case, although each machine can have reasonable size of data, but sum of data is likely to be very big.

Technical explanation

External sorting is a term for a class of sorting algorithms that can handle massive amounts of data. External sorting is required when the data being sorted do not fit into the main memory of a computing device (usually RAM) and instead they must reside in the slower external memory (usually a hard drive). External sorting typically uses a hybrid sort-merge strategy. In the sorting phase, chunks of data small enough to fit in main memory are read, sorted, and written out to a temporary file. In the merge phase, the sorted subfiles are combined into a single larger file.

Algorithm

One example of external sorting is the external merge sort algorithm, which sorts chunks that each fit in RAM, then merges the sorted chunks together.[1][2] For example, for sorting 900 megabytesof data using only 100 megabytes of RAM:

Read 100 MB of the data in main memory and sort by some conventional method, like quicksort.
Write the sorted data to disk.
Repeat steps 1 and 2 until all of the data is in sorted 100 MB chunks (there are 900MB / 100MB = 9 chunks), which now need to be merged into one single output file.
Read the first 10 MB (= 100MB / (9 chunks + 1)) of each sorted chunk into input buffers in main memory and allocate the remaining 10 MB for an output buffer. (In practice, it might provide better performance to make the output buffer larger and the input buffers slightly smaller.)
Perform a 9-way merge and store the result in the output buffer. Whenever the output buffer fills, write it to the final sorted file and empty it. Whenever any of the 9 input buffers empties, fill it with the next 10 MB of its associated 100 MB sorted chunk until no more data from the chunk is available. This is the key step that makes external merge sort work externally -- because the merge algorithm only makes one pass sequentially through each of the chunks, each chunk does not have to be loaded completely; rather, sequential parts of the chunk can be loaded as needed.

Additional passes

The above example is a two-pass sort: first sort, then merge. Note that there was a single k-way merge, rather than, say, a series of 2-way merge passes as in a typical in-memory merge sort. This is because each merge pass reads and writes every value from and to disk.

However, there is a limitation to single-pass merging. As the number of chunks increases, we divide memory into more buffers, so each buffer is smaller, so we have to make many smaller reads rather than fewer larger ones. Thus, for sorting, say, 50 GB in 100 MB of RAM, using a single merge pass isn't efficient: the disk seeks required to fill the input buffers with data from each of the 500 chunks (we read 100MB / 501 ~ 200KB from each chunk at a time) take up most of the sort time. Using two merge passes solves the problem. Then the sorting process might look like this:

Run the initial chunk-sorting pass as before.
Run a first merge pass combining 25 chunks at a time, resulting in 20 larger sorted chunks.
Run a second merge pass to merge the 20 larger sorted chunks.

Like in-memory sorts, efficient external sorts require O(n log n) time: exponential increases in data size require linear increases in the number of passes. If one makes liberal use of the gigabytes of RAM provided by modern computers, the logarithmic factor grows very slowly: under reasonable assumptions, one could sort at least 500 GB of data on a hard disk using 1 GB of main memory before a third pass became advantageous, and could sort many times that before a fourth pass became useful.[4] Low-seek-time media like SSDs also increase the amount that can be sorted before additional passes help.

RAM size is important here: doubling memory dedicated to sorting halves the number of chunks and the number of reads per chunk, reducing the number of seeks required by about three-quarters. The ratio of RAM to disk storage on servers often makes it convenient to do huge sorts on a cluster of machines[5] rather than on one machine with multiple passes.

Examples

Example of Two-Way Sorting:

N = 14, M = 3 (14 records on tape Ta1, memory capacity: 3 records.)

Ta1: 17, 3, 29, 56, 24, 18, 4, 9, 10, 6, 45, 36, 11, 43

1. Sorting of runs:
  1. Read 3 records in main memory, sort them and store them on Tb1:
    1. 1. 17, 3, 29 -> 3, 17, 29
      2. Tb1: 3, 17, 29
  2. Read the next 3 records in main memory, sort them and store them on Tb2

- - - 1. 56, 24, 18 -> 18, 24, 56
      2. Tb2: 18, 24, 56

1. Read the next 3 records in main memory, sort them and store them on Tb1

- - 1. 4, 9, 10 -> 4, 9, 10
      1. Tb1: 3, 17, 29, 4, 9, 10

- 1. Read the next 3 records in main memory, sort them and store them on Tb2
    1. 1. 6, 45, 36 -> 6, 36, 45
      2. Tb2: 18, 24, 56, 6, 36, 45
  2. Read the next 3 records in main memory, sort them and store them on Tb1

1. (there are only two records left)

- - 1. 11, 43 -> 11, 43
      1. Tb1: 3, 17, 29, 4, 9, 10, 11, 43

- 1. At the end of this process we will have three runs on Tb1 and two runs on Tb2:
  2. Tb1: 3, 17, 29 | 4, 9, 10 | 11, 43
  3. Tb2: 18, 24, 56 | 6, 36, 45 |

1. Merging of runs
2. B1. Merging runs of length 3 to obtain runs of length 6.
3. Source tapes: Tb1 and Tb2, result on Ta1 and Ta2.

- 1. Merge the first two runs (on Tb1 and Tb2) and store the result on Ta1.
  2. Tb1: 3, 17, 29 | 4, 9, 10 | 11, 43
  3. Tb2: 18, 24, 56 | 6, 36, 45 |

- 1. Thus we have the first two runs on Ta1 and Ta2, each twice the size of the original runs:
  2. Next we merge the third runs on Tb1 and Tb2 and store the result on Ta1. Since only Tb1 contains a third run, it is copied onto Ta1:

Reference

https://en.wikipedia.org/wiki/External_sorting

http://faculty.simpson.edu/lydia.sinapova/www/cmsc250/LN250_Weiss/L17-ExternalSortEX1.htm