Homework 1

Deadline: 30th of October 2016.

Submit: by web form

General rules: read them

Description of the problem

Create a program that sorts a file stored on a computer's hard drive. Assume that the input file is stored in the current directory and named data.txt. The file data.txt is a text file. Each row consists of a single 63-bit unsigned value -- a key. In the output file data.out the keys should be sorted in ascending order without duplicates. Also for each key write the number of the first row on which it occurred. The number of the first row of the data file is 1. You have to read the input data.txt, write the output data.out, and keep all temporary files in the current directory.

Your solution has to read the data from the file data.txt and write the sorted data into the file data.out. Assume that the files are in the current directory.

The memory of the test computer is 8 GiB, OS Linux.

Example

Input

96313

6024613951808161023

8417427497858712892

3894

96313

6437842131676518652

Output

3894 4

96313 1

6024613951808161023 2

6437842131676518652 6

8417427497858712892 3

Implementation challenges

Memory

Most of the running time of your program will be consumed by reading and writing the input and output data, so-called I/O. Therefore it is vital to minimize the number of I/O operations. For example, compare the time necessary to compare two integers, which is in the order of nanoseconds, with the time required to read an integer from the hard disk drive, which is in milliseconds [1]. When reading from the hard disk drive it is necessary to wait until the plate rotates under the head (rotational latency) and/or the head is moved into the right position (seek time). Hence it is convenient to read subsequent data at once [2]. The page cache of the operating system and the I/O buffer addresses this problem [3]. Some of the disadvantages of hard disk drives are improved by SSDs.

It is also necessary to choose a fast I/O library. You can find many comparisons and blogs on the Internet dealing with this problem [4]. The only reasonable advice is that you should test and measure the performance of your approach yourself. In this task the basic functions of your programming language should suffice.

To pre-sort input data it is also convenient to know the size of the operating memory of your computer in advance. For the testing environment it is 8 GiB. Pay attention to allocating a reasonable amount of memory. If you allocate more than available, then paging may occur. Paging (using swap space in Linux) requires disk I/O operations and thus significantly slows down your program [5]. When measuring your running time, always carefully setup the size of the allocated memory with respect to the used computer. For our purposes a constant in the source code is fine.

  1. http://www.agner.org/optimize/instruction_tables.pdf, https://en.wikipedia.org/wiki/Instructions_per_second

  2. 6. course of Principles of Computers, Wikipedia - HDD performance characteristics

  3. Page Cache Basics in Linux, Java: BufferedInputStream

  4. Faster IO in C, Why does java read a big file faster than C, C++ performance vs. Java/C#, C++ Streams vs. C-style IO?, When to use printf/scanf vs cout/cin?, Using scanf() in C++ programs is faster than using cin?

  5. 16. course of Principles of Computers, Principles of Computers - paging, archive, Wikipedia: paging

Hardware

To estimate the speed of your solution, use computers in the unix laboratory Rotunda u-pl1 to u-pl37. These computers are of two types. The first one is faster and has 16 GiB of operating memory and also SSDs. The second one has only 4 GiB of memory and smaller HDDs. The evaluation environment will have 8 GiB of memory and an HDD. Set up your solution so that it works efficiently in this environment. The common memory state of the test computer is:

$ free -m total used free shared buff/cache available Mem: 7476 139 132 81 7204 6985 Swap: 7985 0 7984

Data generator and testing

Data generator

To generate data use the generator available at the unix Rotunda laboratory /afs/ms.mff.cuni.cz/u/b/babkm5am/ds1/hw1/gen-data.

Generator options

Without any option the generator outputs a file with the length used for the evaluation on the test computer.

-s XX

XX is the seed of the pseudo-random generator. When debugging, use several seeds and verify that the speed of your solution is consistent.

-l LL

LL is the length of the output, i.e. the number of keys generated.

--half

Shortens the length by half. The output size will be around 25 GiB.

--short

Sets the length to 1/16 compared to the standard length. The output size is around 3 GiB.

--super-short

Sets the length to 1/256 compared to the standard length. Tho output size is around 200 MiB.

Examples

$/afs/ms.mff.cuni.cz/u/b/babkm5am/ds1/hw1/gen-data -s XX >/tmp/data.txt

generates the testing data, approximately 44 GiB.

$/afs/ms.mff.cuni.cz/u/b/babkm5am/ds1/hw1/gen-data -s XX --short >/tmp/data.txt

data useful for debugging, around 3 GiB.

Testing

Use the unix program /afs/ms.mff.cuni.cz/u/b/babkm5am/ds1/hw1/gen-data available at Rotunda laboratory to generate the input file data.txt. Just redirect the standard output of the generator to the file data.txt. Then measure the running time of your solution using one of the computers from u-pl to u-pl37. For measuring use the time utility. The following table gives example running times needed to obtain all points (at most 13 and 28 minutes on the fast and slow computers, respectively), and the times for which you would receive no points (more than 19 and 45 minutes on the fast and slow computers, respectively).

Your program should fulfil time limits on both types of computers. Our testing computer has 8 GiB RAM and an HDD disk. Your source code may contain a constant determining the amount of data processed at once. This constant needs to be set appropriately for the testing computer before submission.

When running your program, save the input file data.txt, the output file data.out, and the temporary files into the /tmp directory. You have to read the input data.txt, write the output data.out, and keep all temporary files in the current directory. After finishing your work, delete all your files in /tmp so that other students can run their programs. If you do not delete your files you consume almost the whole disk space. Especially, do not forget about the input and output files. Also remember that others cannot delete your files and thus the computer will be blocked.

Also be careful when you see that someone else uses the computer. Use commands who and top or htop. If you see that someone else is performing computations use another computer or wait until he or she is finished. Please report any intolerant behaviour or a blocked computer to ds1@kam.mff.cuni.cz. The lab administrator can clear the /tmp directory, if necessary.

Submitting your solution

Submit your solution (the source code) before the deadline at https://kam.mff.cuni.cz/~ds1/. Send us your compilation command in the note, e.g. "compilation: g++ -O3 sorter-55973318.cpp -o sorter-55973318". The description is a text file in which you should describe your solution. Please submit only correct and fast enough solutions by 30th of October 2016.