Processing a large number of files

When trying to process large numbers of files, there are several things you should do to improve performance.

- Divide and conquer. Split the input into large chunks of data and spawn a program to process each chunk. On a SMP system you take advantage of the multiple cpu's, and each chunk could be serviced by an individual cpu. The xargs utility works well for that task. For example:

cat list_file | xargs -n number-of-args -P number-of-conncurrent-procs some-program

This command line will parse the input (list_file) and pass the specified number of arguments to the program. The -P will spawn the specified number of instances of the program, each program executed with the specified number of arguments. Of course this expects that the program is capable of processing more than 1 argument. If the program can only accept a single argument, then you're bottlenecked on the program.

As an example, lets say you need to cat all the files in a given directory and redirect it's output to a single file. Lets assume the directory has 120k files in it. A logical progression would be:

cd some-directory

cat * > ../saved_output

Looks good. But here's the catch. When you execute cat * you get this:

pm1:~/prog/work/datadir# cat *

-bash: /bin/cat: Argument list too long

You have exceeded the capacity of the cat program with it's argument list. Here's where xargs comes to the rescue:

cd some-directory

ls | xargs -n 32 -P 8 cat >> ../saved_output

The xargs utility will take the output of ls, break it into chunks of 32 (32 arguments) and spawn up to 8 conncurrent cat processes. Notice that I used the append operator (>>) instead of just a plain redirect. This is necessary due to the fact that we're running 8 conncurrent cat processes, AND, that cat will be respawned as long as their is data coming from ls. The append will ensure that each invocation of cat will write it's data to the end of the output file, instead of writing the output file from the beginning.

Use script for simple short tasks or prototyping. Use C for heavy weight high performance processing. Rather than running a while read loop in script code, it would be much faster to move that logic into a C program. Given our example above with a directory with 120k files in it, here is a simple C program that will read all the files in the directory:

#include <stdio.h>

#include <stdlib.h>

#include <dirent.h>

#include <unistd.h>

#include <errno.h>

struct dirent *dptr;

// Name: doit3.c

// This is a program to demonstrate the process of reading and processing

// thousands of files in a given directory. The structure of each of the

// data files is:

// file-number value1 value2 value3 value4

// Each value is a decimal.

int main(int argc, char *argv[]) {

int a, b, c, d, e;

DIR *dirp;

FILE *fpin, *fpout;

// open the directory with all the data files in it.

if((dirp=opendir("./datadir"))==NULL) {

fprintf(stderr,"cannot open directory ./datadir\n");

exit(1);

}

// open our output file.

if((fpout = fopen("./data_out","w")) == NULL) {

fprintf(stderr,"cannot open ./data_out\n");

exit(1);

}

// change into the data directory

if(chdir("./datadir") < 0) {

error("cannot change to ./datadir\n");

}

// loop to read the filenames in the data directory

while(dptr=readdir(dirp)) {

// we do not want to process filename "."

if(strcmp(".",dptr->d_name) == 0) {

continue;

}

// we do not want to process filename ".."

if(strcmp("..",dptr->d_name) == 0) {

continue;

}

// open the data file

if((fpin = fopen(dptr->d_name,"r")) == NULL) {

error("cannot open %s\n", dptr->d_name);

}

// read the contents of the data file

fscanf(fpin,"%d %d %d %d %d", &a, &b, &c, &d, &e);

// write to the output file

fprintf(fpout,"file: %d data: %d\n", a, b+c+d+e);

// close our input file

fclose(fpin);

}

// close our directory handle

closedir(dirp);

// close our output file

fclose(fpout);

// all done.

exit(0);

}

Use the Cache. Since we're accessing so many files, it would be to our advantage if those files were cached before we run our program. The way you cache a file is to access it. By having the file in cache, the fopen() call in our program returns much faster, as their's no disk access necessary. The following simple script will cache all the files in the given directory:

cd ./datadir

ls | xargs -n 32 -P 8 cat > /dev/null

It will take several seconds for the above command to complete. Now when you run the C program, it will execute at lightning speed.

- Performance. The above listed program is called doit3.c. The execution environment is a para-virtual Xen client, running Debian 4.0, with 512MB ram. The host system is a eMachines sempron 1.8Mhz processor with 2GB ram, running OpenSuse 11.0. When executed without pre-loading cache the timings are:

pm1:~/prog/work# time ./doit3

real 18m26.528s

user 0m0.148s

sys 0m0.668s

pm1:~/prog/work#

The total execution time was 18 minutes, without pre-loading the cache.

When pre-loading the cache is performed, the timings are as follows:

pm1:~/prog/work# time ./load_cache

real 0m33.183s

user 0m1.796s

sys 0m6.640s

pm1:~/prog/work# time ./doit3

real 0m13.715s

user 0m0.484s

sys 0m5.988s

pm1:~/prog/work#

You can see that the total execution time for both the cache load and the application is around 45 seconds. A big difference from the 18 minute execution time without the cache pre-loading.

The only caveat when pre-loading the cache is the consumption of memory, as indicated below:

pm1:~/prog/work# free

total used free shared buffers cached

Mem: 524436 519456 4980 0 17012 365276

-/+ buffers/cache: 137168 387268

Swap: 131064 20 131044

pm1:~/prog/work#

You see an increase in used memory as well as some swap space usage.