Processing a large number of files
When trying to process large numbers of files, there are several things you should do to improve performance.
Divide and conquer. Split the input into large chunks of data and spawn a program to process each chunk. On a SMP system you take advantage of the multiple cpu's, and each chunk could be serviced by an individual cpu. The xargs utility works well for that task. For example:
cat list_file | xargs -n number-of-args -P number-of-conncurrent-procs some-program
This command line will parse the input (list_file) and pass the specified number of arguments to the program. The -P will spawn the specified number of instances of the program, each program executed with the specified number of arguments. Of course this expects that the program is capable of processing more than 1 argument. If the program can only accept a single argument, then you're bottlenecked on the program.
As an example, lets say you need to cat all the files in a given directory and redirect it's output to a single file. Lets assume the directory has 120k files in it. A logical progression would be:
cd some-directory
cat * > ../saved_output
Looks good. But here's the catch. When you execute cat * you get this:
pm1:~/prog/work/datadir# cat *
-bash: /bin/cat: Argument list too long
You have exceeded the capacity of the cat program with it's argument list. Here's where xargs comes to the rescue:
cd some-directory
ls | xargs -n 32 -P 8 cat >> ../saved_output
The xargs utility will take the output of ls, break it into chunks of 32 (32 arguments) and spawn up to 8 conncurrent cat processes. Notice that I used the append operator (>>) instead of just a plain redirect. This is necessary due to the fact that we're running 8 conncurrent cat processes, AND, that cat will be respawned as long as their is data coming from ls. The append will ensure that each invocation of cat will write it's data to the end of the output file, instead of writing the output file from the beginning.
Use script for simple short tasks or prototyping. Use C for heavy weight high performance processing. Rather than running a while read loop in script code, it would be much faster to move that logic into a C program. Given our example above with a directory with 120k files in it, here is a simple C program that will read all the files in the directory:
#include <stdio.h>
#include <stdlib.h>
#include <dirent.h>
#include <unistd.h>
#include <errno.h>
struct dirent *dptr;
// Name: doit3.c
// This is a program to demonstrate the process of reading and processing
// thousands of files in a given directory. The structure of each of the
// data files is:
//
// file-number value1 value2 value3 value4
//
// Each value is a decimal.
int main(int argc, char *argv[]) {
int a, b, c, d, e;
DIR *dirp;
FILE *fpin, *fpout;
// open the directory with all the data files in it.
if((dirp=opendir("./datadir"))==NULL) {
fprintf(stderr,"cannot open directory ./datadir\n");
exit(1);
}
// open our output file.
if((fpout = fopen("./data_out","w")) == NULL) {
fprintf(stderr,"cannot open ./data_out\n");
exit(1);
}
// change into the data directory
if(chdir("./datadir") < 0) {
error("cannot change to ./datadir\n");
}
// loop to read the filenames in the data directory
while(dptr=readdir(dirp)) {
// we do not want to process filename "."
if(strcmp(".",dptr->d_name) == 0) {
continue;
}
// we do not want to process filename ".."
if(strcmp("..",dptr->d_name) == 0) {
continue;
}
// open the data file
if((fpin = fopen(dptr->d_name,"r")) == NULL) {
error("cannot open %s\n", dptr->d_name);
}
// read the contents of the data file
fscanf(fpin,"%d %d %d %d %d", &a, &b, &c, &d, &e);
// write to the output file
fprintf(fpout,"file: %d data: %d\n", a, b+c+d+e);
// close our input file
fclose(fpin);
}
// close our directory handle
closedir(dirp);
// close our output file
fclose(fpout);
// all done.
exit(0);
}
Use the Cache. Since we're accessing so many files, it would be to our advantage if those files were cached before we run our program. The way you cache a file is to access it. By having the file in cache, the fopen() call in our program returns much faster, as their's no disk access necessary. The following simple script will cache all the files in the given directory:
cd ./datadir
ls | xargs -n 32 -P 8 cat > /dev/null
It will take several seconds for the above command to complete. Now when you run the C program, it will execute at lightning speed.
Performance. The above listed program is called doit3.c. The execution environment is a para-virtual Xen client, running Debian 4.0, with 512MB ram. The host system is a eMachines sempron 1.8Mhz processor with 2GB ram, running OpenSuse 11.0. When executed without pre-loading cache the timings are:
pm1:~/prog/work# time ./doit3
real 18m26.528s
user 0m0.148s
sys 0m0.668s
pm1:~/prog/work#
The total execution time was 18 minutes, without pre-loading the cache.
When pre-loading the cache is performed, the timings are as follows:
pm1:~/prog/work# time ./load_cache
real 0m33.183s
user 0m1.796s
sys 0m6.640s
pm1:~/prog/work# time ./doit3
real 0m13.715s
user 0m0.484s
sys 0m5.988s
pm1:~/prog/work#
You can see that the total execution time for both the cache load and the application is around 45 seconds. A big difference from the 18 minute execution time without the cache pre-loading.
The only caveat when pre-loading the cache is the consumption of memory, as indicated below:
pm1:~/prog/work# free
total used free shared buffers cached
Mem: 524436 519456 4980 0 17012 365276
-/+ buffers/cache: 137168 387268
Swap: 131064 20 131044
pm1:~/prog/work#
You see an increase in used memory as well as some swap space usage.