Map/Reduce Monte-Carlo

April 2020 -- archived, ... now superseded by

https://github.com/patrickbwarren/map-reduce-monte-carlo

This page describes some ideas around using the Map/Reduce paradigm for Monte-Carlo (MC) simulations. The basic concept behind MC is to calculate an average property by sampling a large set of N >> 1 states of some system. For example the system might be a box containing particles, and the property of interest might be the pressure, which can be calculated from the particle positions using the virial theorem. A little thought shows that this is a natural candidate for Map/Reduce. The mapper would generate pressure measurements for n >= 1 configurations, and the reducer would do the statistical averaging. To solve the original problem, N / n instances of the mapper are launched. And the rest is detail... Actually, for my own calculations I do not use a proper Map/Reduce harness (like Apache Hadoop), but the principle is the same. For a related perspective see here.

"In Hartree's opinion [ca 1951] all the calculations that would ever be needed [in the UK] could be done on three digital computers." -- B. V. Bowden

At some point we've all realised that Monte-Carlo (MC) simulations which depend on statistical sampling can be parallelised trivially, but efficiently, by generating samples independently. This is sometimes referred to as high-throughput parallelisation. The approach mainly just requires careful bookkeeping since there is no interprocess communication to worry about. It is ideally suited to loosely-coupled clusters, such as implemented in the Condor high-throughput computing environment. With a little organisation (described in more detail below) one can launch such a parallel job with a single Condor script, and have the results combined automatically afterwards. Such was more-or-less the state of play with one of my kinetic Monte-Carlo codes in the fall of 2009. I had been running on 5-10 Condor nodes, with each job lasting half an hour or so. At some point I noticed that a large proportion of the nodes on the Condor cluster had become free. So I launched the next job on 100 nodes, with a wall-clock turnaround time of 1-2 minutes. I figured that I would be out of the way before anyone noticed -- and indeed no-one did. I ran the next job with a wall-clock turnaround time of about 30 seconds, and so on. Of course at some point (after lunch!) other peoples' jobs started appearing in the queue again, and being polite I stopped. But the period where I had the cluster to myself to use in this unconventional way was a revelation. It changed my way of working, and the way I think about high-performance computing. That's the main purpose of this web page -- to convince you that this is a different but valid way of doing things.

If you think about it, this really is very unconventional. Aside from the fact that high-throughput parallelisation is usually a bit frowned-upon, since it's not 'real' parallel computing, batch jobs are usually expected to run for hours, not fractions of a minute. Indeed compute queues are often set up this way. The high priests of HPC would have a fit with my double abuse of the system! But consider: a job which runs for 30 seconds each across a couple of hundred nodes is equivalent to burning several hours of processor time. This is often enough for accurate statistical sampling in a Monte-Carlo code where typically you might need 104 samples for 1% accuracy so each node only needs to generate a few tens of samples. The difference is that now we have the answer in 30 seconds, rather than 2 hours. And what a difference this makes! It means I can just play around if I want, changing parameter settings for instance, since I'm getting feedback almost in real time, and I don't care if I screw things up a few times.

You might of course complain that the overhead in submitting a Condor job eats into the speed-up. This is certainly true, and it means that there's not much point submitting a job which lasts only a few seconds. Another thing you could complain about is that the initialisation overhead in the simulation is now quite important. But although the initialisation and scheduling may account for a significant fraction of the elapsed time, like so much in this alternate universe of 30-second jobs: I don't care. Efficiency is not the end-game here, and this realisation is all part of the paradigm shift.

In trying to explain all this to my work colleagues I was casting around for analogies, and the best one I hit upon was web search -- running Monte-Carlo simulations with the immediacy of Google. I soon realised this was more than just an analogy, and what I was playing around with was a kind of poor-man's Map/Reduce paradigm for scientific computing. So that's the other purpose of this web page -- to share some tricks I figured out on the way.

"Any sufficiently advanced perl script is virtually indistinguishable from magic." -- with apologies to Arthur C. Clarke

For a long time I wrote monolithic simulation codes, which grew to have extravagant input file parsers. Then I realised what I needed was a scriptable interface to the underlying simulation engine so that the work could be split between a kernel number crunching effort, written in a low-level language like C, and the peripheral but necessary bookkeeping activities, which could be handled by a high-level scripting language. Thus I came across SWIG, and not for the first time changed my way of working. (Later I came across f2py, which does a similar job for FORTRAN.) My scripting language of choice at the time was and still is perl, though I wish I had started instead with python.

My current way of working is as follows. The core simulation engine is a library of C functions which take on the job of initialising storage, making Monte-Carlo sweeps, taking measurements, etc. The central feature is a C struct which holds all the simulation details. When passed through SWIG this becomes a perl object. A perl script can instantiate the object, eg with $sim = MonteCarlo::new_system(...) where 'new_system' is a C function in my library that takes care of memory allocation. Parameter initialisation looks like, eg, $sim->{Lx} = 10.0; where 'Lx' is member of the struct. Function calls look like, eg, $sim->mc_sweep(...); where 'mc_sweep' is a C function which makes a Monte-Carlo sweep (in such a function the first parameter is a pointer to the simulation struct). Measurements look like, eg, $press = $sim->virial_pressure; where 'virial_pressure' is a C function which makes a measurement of the pressure and returns the result as a double. The perl script itself could be controlled by an input file, but I prefer to use GetOpt::Long to specify parameters and simulation control variables. Usually I just have one (monolithic!) C library, but there may be a variety of perl scripts, with fine control provided by command line options.

"Use a DAG, man!"

This leads me to Condor. Let's return to the fundamental idea behind high-throughput parallelisation: configurations can be generated in parallel, and statistical averaging can be done afterwards. It's a matter of choice whether you save an entire stack of configurations for post-processing, or perform some processing on the fly to cut down file sizes. I usually go for the latter since for most of my problems it's relatively inexpensive to generate configurations. For instance if I'm interested in the pressure, I calculate (on the fly, in parallel) the instantaneous pressure for each configuration, and write out its value to a file. Then the post-processing is a simple statistical average which gives the result I am interested in (for the purposes of error estimation I find that block-averaging works well). Something that is a little bit more complicated is for instance the structure factor. Here I would compute the instantaneous structure factor for each configuration, and write out the [k, S(k)] pairs to a file. Then the post-processing consists in statistical averaging using the k value as a label (I have a general purpose perl script which does this for me). If I am generating huge numbers of configurations (eg > 106) then I may well also do the block averaging on the fly. The key to this approach is that it meshes quite nicely with Condor. By having each job place its data in a separate file, they can be combined afterwards, usually by simple file concatenation, and post-processed into the answers I'm interested in.

With the scripting approach, submitting a Condor job amounts to asking it to run the corresponding perl script. The associated run-time files should be transferred (with transfer_input_files) -- these are the perl script itself, the perl module (.pm) and shared object (.so) files generated by SWIG, and any other associated input files such as a table of random number seeds as described below. The actual executable is perl itself, and the argument list starts off with the name of the perl script, followed by a list of command-line options. I make use of the ability of Condor to submit many copies of the same job (with the queue command). Each copy gets its own value of the Condor variable $(Process), which gets passed onto the perl script as a command-line option. The perl script looks out for this option and uses its value to select a random number seed (see below) and label the output files. Thus each process generates its own unique output files, which are then concatenated and post-processed into the result of the run. As a further twist I use the Condor meta-scheduler (DAGMan) to launch the (multiple) Condor jobs, then after all these have completed the DAGMan job runs a suitable post-processing clean-up script.

Now there are certainly several ways one might organise the above tasks. The way I have chosen is like an operating system fork -- the initial call to the perl script has a command line parameter --condor=n which causes the script to generate automatically the job submission files and clean-up script, submit the Condor DAGMan job, then exit. The DAGMan job then launches the Condor job, with queue n, which runs the same perl script but without the --condor command line parameter so that this time the script forks to do the required calculations. Additional command line parameters for the first call of the perl script are incorporated into the Condor job description, and thus passed onto the forked script. Writing things this way keeps everything self-contained in a single script.

"Random numbers should not be generated with a method chosen at random" -- Donald E. Knuth

Or to put it another way: where random numbers are concerned, nothing should be left to chance. Actually these days there is no excuse not to use a modern high-quality pseudo-random number generator (RNG), such as the general purpose algorithm recommended in Numerical Recipes 3rd edition (and be sure it is the third edition!), or the Mersenne-Twister generously made freely available by Makoto Matsumoto (松本 眞) and Takuji Nishimura (西村 拓士). So that's solved, right? Not quite -- the issue we have to deal with is seeding the chosen RNG since in the Map/Reduce paradigm we need perhaps 100s of independent RNG seeds. Also, quite often we want to repeat a simulation run with exactly the same set of RNG seeds (for example, for regression testing). The way I have implemented this is to have a table of, say, 1000 8-digit RNG seeds as a flat text file. In the Condor run described above, the name of the flat text file containing the PRNG seeds is passed to the perl script as a parameter and each Condor job selects the RNG seed from the entry in the table corresponding to the value of the $(Process) variable, which is also passed as a parameter (it is the perl script which does the job of reading in the table of RNG seeds and selecting the appropriate value, then the C function that initialises the RNG is called with this value). If I want to repeat the run I use the same RNG seed file; if I want an independent run, I use a different PRNG seed file. Now all that remains is to source the RNG seeds. There are several options to build the PRNG seed files :

Oh, and by the way, once you've seeded your PRNG, make sure you throw away the first 106 (or so) random numbers. This typically adds a minuscule amount of overhead and eliminates any relic correlations that might be present from initialising with seeds which are smaller than the storage associated with the PRNG. And finally, remember:

Anyone who attempts to generate random numbers by deterministic means is, of course, living in a state of sin.” -- John von Neumann