https://arxiv.org/abs/1403.7645
https://doi.org/10.1007/s00180-014-0492-3
\begin{document}
\title[Using RngStreams with C++ and R] {Using RngStreams for Parallel Random Number Generation in C++ and R}
%\subtitle{}
%\titlerunning{Using RngStreams with C++ and R}
\author[Karl, Eubank, Milovanovic, Reiser, and Young]{Andrew T. Karl}
\address{Adsurgo LLC, Denver, CO}
\author{Randy Eubank and Jelena Milovanovic and Mark Reiser and Dennis Young
}
\address{Arizona State University, Tempe, AZ}
\begin{abstract}
The \texttt{RngStreams} software package provides one viable solution
to the problem of creating independent random number streams for
simulations in parallel processing environments. Techniques are
presented for effectively using \texttt{RngStreams} with \texttt{C++}
programs that are parallelized via \texttt{OpenMP} or \texttt{MPI}.
Ways to access the backbone generator from \texttt{RngStreams} in
\texttt{R} through the \texttt{parallel} and \texttt{rstream} packages
are also described. The ideas in the paper are illustrated with both a
simple running example and a Monte Carlo integration application.
\end{abstract}
\maketitle
\section*{NOTICE}
This is a preprint of an article appearing in \textit{Computational Statistics} (in press). The final publication is available at http://dx.doi.org/10.1007/s00180-014-0492-3
\section{Introduction}
\label{intro}
Simulation studies are in many respects the ideal application for
parallel processing. They are ``naturally parallel'' in the sense
that computations may generally be conducted without the need for
inter-processor communication or modification of the base algorithm.
Thus, near linear speedup can be realized in many cases that occur in
practice. However, in order for a division of labor across multiple
processors to be productive, the random number streams being used by
each processor must behave as if they are independent in a
probabilistic sense. Fortunately, methods now exist that, when used
properly, assure the requisite stream independence. We explore one
such approach in this paper that is available through the
\texttt{RngStreams} software package.
The effect of stream dependence is easy to demonstrate as seen in the
output from an \texttt{R} session exhibited below.
\lstset{language=R, basicstyle={\ttfamily\footnotesize}}
\begin{lstlisting}
> x <- NULL
> for (i in 1:200) {
+ set.seed(123)
+ .Random.seed[3:626] <- as.integer(.Random.seed[3:626] +
+ (i - 1) * 100)
+ zTemp <- rnorm(1000)
+ x <- c(x, sqrt(1000) * mean(zTemp))
+ }
> set.seed(123)
> y <- NULL
> for (i in 1:200) {
+ zTemp <- rnorm(1000)
+ y <- c(y, sqrt(1000) * mean(zTemp))
+ }
> cat(shapiro.test(x)$p.value, shapiro.test(y)$p.value, "\n")
0.0001438936 0.4466189
\end{lstlisting}
Following along the lines of \citet{hech}, we have generated sets of
1000 random numbers from the standard normal distribution using 200
different random number streams deriving from \texttt{R}'s default
Mersenne Twister (MT) random uniform generator. The streams are
constructed by using seeds (vectors of length 624) that are spaced 100
units apart in each component. The mean from each stream is then
rescaled to produce numbers that have ostensibly been sampled from a
standard normal distribution. However, the Shapiro-Wilk test firmly
rejects the normality hypothesis. In contrast, data from a similar
process that permits the normal evolution of the MT states easily pass
the Shapiro-Wilk evaluation. Although the relationship between
streams will likely be much more complex in situations that arise in
practice, the message remains the same: inter-stream dependence can
sabotage a parallel random number generation scheme.
\citet{hill} reviews available methods for parallel random number
generation, including random spacing (workers are initialized to
randomly spaced positions on the period of the same generator by
assigning different, randomly generated seeds), sequence splitting
(dividing a sequence into non overlapping contiguous blocks), cycle
division (a more involved technique for sequence splitting where the
period of a generator is deterministically divided into segments) and
parametrization (parameters associated with a generator are varied to
produce different streams). The random spacing method carries a risk
of overlap between random streams on different processors with an
unlucky sampling of seeds, although the risk is small when using
generators with large periods.
MT provides one example of a long-period generator with the period
$2^{19937}-1$ \citep{MT}. This feature along with its widespread
availability has made MT a popular choice for parallel generation via
random spacing. In spite of this, there are reasons that make the
development and use of other parallel generation methods
worthwhile. For example, it is generally advisable to repeat
simulations with different generators to ensure that the results are
not an artifact of the structure of a particular generator
\citep{gentle}. So, there is value in having other available generator
options for this purpose. In addition, a method such as cycle division
comes with a guarantee that the streams will not collide, whereas
random spacing only renders such events unlikely. \citet{lemieux}, for
example, advocates against random spacing and recommends using
generators that are guaranteed not to overlap. As a case in point, she
directs us to the work of \citet{failure} who show that many modern
random number generators exhibit correlations if their initial states
are chosen using another linear generator with a similar modulus.
Issues that arise when using linear congruential generators for
parallel random number generation have been well documented. Random
spacing and sequence splitting may result in random number streams
that exhibit undesirable dependence properties (e.g.,
\citet{matteis1}, \citet{matteis2}, \citet{anderson}, and
\citet{etacher}). Similar results have been quantified for some
nonlinear generators. For example, \citet{matteis3} show that long
range correlation is present in random number streams produced by
sequence splitting with any generator of the form $x_n = f(x_{n-1})
\bmod 2^m$ if the generator is such that the period halves when the
modulus halves.
Many of the problems with congruential generators can be avoided by
using combined multiple recursive generators (CMRG) with cycle
division. A MRG is roughly equivalent to a multiple
recursive generator with a period that can be as large as the product
of the periods of the generators used in the combination
\citep{crganal}. When combined appropriately (e.g., as in
(\ref{eq:combform}) below), the lattice structure inherent to multiple
recursive generators is effectively blurred \citep{crglattice}. With a
good choice for the parameters (e.g., \citet{crgparameters}), these
combined generators have also fared well when subjected to statistical
tests as in \citet{testart}. As a case in point, the MRG32k3a
generator that provides the backbone generator for the
\texttt{RngStreams} package developed by \citet{rngstream} is known to
have good theoretical and statistical properties. It is cited by
\citet{lemieux}, along with MT, as a generator that can be ``safely
used''.
Of course, cycle division becomes an option only if it can be done
efficiently. For multiple recursive generators \citet{lecuyer} has
shown that there are computable matrices that can be used to advance
the state of the generator to any specified point in its associated
random number stream. This feature is what makes the combination of
generators easy to use in a cycle division paradigm. The
\texttt{RngStreams} package gives an implementation of this approach
in the context of its MRG32k3a generator. \citet{mtjump} have recently
developed a jump ahead method for MT. The number generation step is
two to three times faster for each stream than with
\texttt{RngStreams}; but, the sequence splitting itself is roughly
1000 times slower.
Another noteworthy package that provides functionality for parallel
random number generation is the \texttt{SPRNG} package of
\citet{sprng}. In contrast to \texttt{RngStreams}, it produces
different random number sequences for the processes via
parametrization of a single type of generator. For example, one of the
package's featured generators is a multiplicative lagged Fibonacci
generator whose states fall into $2^{1007}$ different equivalence
classes of generators (each of period $2^{81}$) that provide the
different streams.
In this paper we focus on the use of the \texttt{RngStreams}
package. There are several reasons for this. For example,
\citet{rngstream} in reference to the \texttt{SPRNG} package state
that it is ``not supported by the same theoretical analysis for the
quality and independence of the different streams'' as
\texttt{RngStreams}. There are also practical difficulties that arise
in using \texttt{SPRNG} due to its ties to \texttt{MPI} and a rather
complex implementation that necessitates the creation and linking of a
compiled library. In contrast, the \texttt{RngStreams} package is
quite compact with all its source code available from
\url{http://www.iro.umontreal.ca/~lecuyer/myftp/streams00/}.
The \texttt{C++} implementation of \texttt{RngStreams} provides an
object oriented framework where objects are created which have an
associated method \texttt{RandU01} that produces uniform random
deviates. The overall result is that \texttt{RngStreams} is easy to
incorporate into \texttt{C/C++} programs using \texttt{OpenMP} or
\texttt{MPI} through simple include directives. This property extends
its applications from cluster down to desktop environments.
General treatments of parallel computing in \texttt{R} are provided by
\citet{schmid} and \citet{eugster}. In contrast to these papers, the
\texttt{R} segments of this article are directed toward the specific
task of parallel random number generation in the language. In this
regard, the \texttt{RngStreams} generators become available in
\texttt{R} through the \texttt{rstream} package \citep{rstream}. They
are also the default generator for the \texttt{parallel} package that
provides multithreading capabilities and now comes as part of the
\texttt{R} \texttt{base} software.
The goal of this article is to illustrate how easily
\texttt{RngStreams} can be employed in parallel settings in
\texttt{C++} with \texttt{OpenMP} and \texttt{MPI} and in \texttt{R}
\citep{R} through the \texttt{rstream} and \texttt{parallel}
packages. Our initial presentation involves only the most elementary
features of these APIs. However, it is our belief that these simple
programs contain fundamental techniques that can provide a foundation
for safe parallel random number generation in much more complex code
that might evolve in practice. An application of Monte Carlo
integration in the context of an item trait model is used to support
this contention.
In the next section we describe the basic features of the
\texttt{RngStreams} generator. Then, in subsequent sections we address
its use in \texttt{C++} and \texttt{R}. These sections are connected
through a common example that allows us to illustrate both the
differences and similarities in using \texttt{RngStreams} under the
different APIs.
\section{RngStreams}
\label{sec;rngstreams}
The ``backbone'' generator for \texttt{RngStreams}, often referred to
as MRG32k3a, is a combination of the two multiple recursive generators
that produce the states
\begin{eqnarray}
\label{eq:rng1}
x_{1,n}&=&(1403580\times x_{1,n-2}-810728\times x_{1,n-3})\bmod (4294967087),\\
\label{eq:rng2}
x_{2,n}&=&(527612\times x_{2,n-1}-1370589\times x_{2,n-3})\bmod (4294944443)
\end{eqnarray}
at the $n$th step of the recursion given initial seeds
$\tilde{x}_{i,0} = (x_{i,-2},x_{i,-1},x_{i,0})^T, i = 1,2.$ The two
states are combined to produce the uniform random deviate $u_n$ via
the rule
\begin{eqnarray}
\label{eq:combform}
z_n&=&(x_{1,n}-x_{2,n}) \mod 4294967087,\\
u_n&=&\left\{\begin{array}{l l}
z_n/4294967088,&\text{ if } z_n>0, \\
4294967087/4294967088,&\text{ if } z_n=0.
\end{array}
\right.
\nonumber
\end{eqnarray}
This particular generator is capable of producing a random number
stream of period length (roughly) $2^{191}$ under the condition that
$x_{1,0},x_{1,-1},x_{1,-2}$ are not all 0 and are all less than
4294967087 and $x_{2,0},x_{2,-1},x_{2,-2}$ are all less than
4294944443 and not all 0. For use in a parallel context this long
cycle is divided into $2^{64}$ non-overlapping streams each of length
$2^{127}$.
The key to accessing different streams produced by
(\ref{eq:rng1})-(\ref{eq:rng2}) is the technique described in
\citet{lecuyer} that allows movement between streams through linear
transformations of the initial seeds using known
matrices. Specifically, if $\tilde{x}_{1,0}, \tilde{x}_{2,0}$ are the
initial seeds/states for the two generators, at the $n$th step of the
recursion the state for generator $i$ is $\tilde{x}_{i,n} = (A_i^n
\bmod m_i)\tilde{x}_{i,0}\bmod m_i, i = 1, 2,$ for known $3 \times 3$
matrices $A_1, A_2$ with $m_1 = 4294967087, m_2 = 4294944443$. The
powers of the matrices $A_1, A_2$ can be computed explicitly and, in
particular the matrices $A_1^{127}, A_2^{127}$ that are needed for
movement between the different streams of the generator have been
calculated and are contained in an anonymous namespace in the
\texttt{RngStream.cpp} file. This feature is used in the \texttt{MPI}
section of our paper.
The initial state of the package is set using the function
\texttt{SetPackageSeed} that has prototype
\lstset{language=C++,basicstyle={\ttfamily\footnotesize},showstringspaces=false}
\begin{lstlisting}
static bool SetPackageSeed (const unsigned long seed[6])
\end{lstlisting}
with \texttt{seed} containing the six initial seeds for the generator
that must be supplied by the user. The first \texttt{RngStream} object
that is created will use this initial seed. Subsequent objects will
then be started on the next stream; i.e., they will have initial
``seeds'' that have been advanced $2^{127}$ states from the initial state of
the previous object's generator. Our aim in the sequel is to
perform some simple experiments that will illustrate how the seeds are
changed for the different processes that are working concurrently in a
parallel computing environment.
\section{OpenMP}
\label{sec:openmp}
\texttt{OpenMP} represents the standard API for shared memory parallel
processing. An overview of the various \texttt{OpenMP} functions is
provided, e.g., in \citet{chapman}. Here we will use only the API
features that allow us to parallelize a block of code.
The listing below illustrates the use of \texttt{RngStreams} in an
\texttt{OpenMP} program.
\lstset{language=C++,basicstyle={\ttfamily\footnotesize},
showstringspaces=false,numbers=none}
\begin{lstlisting}[frame=lines,caption={ranOpenMP.cpp}, label=ranopenmp]
//ranOpenMP.cpp
#include <omp.h>
#include "RngStream.h"
#include <iostream>
int main(){
int nP = omp_get_num_procs();
omp_set_num_threads(nP);//set number of threads
unsigned long seed[6] ={1806547166, 3311292359,
643431772, 1162448557,
3335719306, 4161054083};
RngStream::SetPackageSeed (seed);
RngStream RngArray[nP];//array of RngStream objects
int myRank;
#pragma omp parallel private(myRank)
{
myRank = omp_get_thread_num();
#pragma omp critical
{
std::cout << RngArray[myRank].RandU01() << " ";
}
}
std::cout << std::endl;
return 0;
}
\end{lstlisting}
The first step is to include the \texttt{OpenMP} header file
\texttt{omp.h} via an include directive. The number of processors
\texttt{nP} is determined and used to set the number of threads (with
\texttt{omp\_get\_num\_procs} and \texttt{omp\_set\_num\_threads},
respectively). The \texttt{SetPackageSeed} function is then used to
set the seeds for the \texttt{RngStreams} package. This particular
choice of seeds was made to provide results that can be compared to
those obtained from \texttt{R} in Section \ref{sec:r}.
The desired random deviates are produced with an array of
\texttt{RngStream} objects: one object for each thread. The
\texttt{RngStream} class default constructor is designed so that the
first object will have the seed designated with
\texttt{SetPackageSeed} while subsequent objects will have initial
states/seeds have been advanced $2^{127}$ states from that of their
predecessor.
An alternative approach that one might try here is to simply create a
unique \texttt{RngStream} object for each thread with a
\texttt{private} specification. In such instances the default
constructor will also be called. However, we have experienced cases
where a race condition can lead to threads with duplicate seeds. The
array approach ensures that all objects are created correctly while
effectively exploiting the shared memory feature that allows all
threads access to the array elements.
A random deviate is generated from each of the \texttt{RngStream}
objects using the class method \texttt{RandU01} that invokes the
backbone generator. These values will serve as subsequent comparison
benchmarks. The actual generation of random numbers in the parallel
region is quite simple with each thread using the object in
\texttt{RngArray} that corresponds to its unique thread number
assigned by the \texttt{OpenMP} API. The resulting values are printed
out from within a \texttt{critical} region to avoid garbled output.
We compile and execute Listing \ref{ranopenmp} using the gnu compiler
suite on a four processor machine in the following manner.
\lstset{language=csh,basicstyle=\ttfamily\footnotesize}
\begin{lstlisting}
$ g++ -fopenmp ranOpenMP.cpp RngStream.cpp -o ranOpenMP
$ ./ranOpenMP
0.341106 0.312399 0.166374 0.149433
\end{lstlisting}
\section{MPI}
\label{sec:mpi}
The \texttt{Message Passing Interface} or \texttt{MPI} provides a
number of functions that can be used for inter-processor
communication; see, e.g., \citet{quinn}. Our development here will use
only the simple communication paradigm where a master and worker
processes transmit and receive messages via the \texttt{MPI}
\texttt{send} and \texttt{recv} functions.
As noted at the beginning of this section, the seeds/states of
\texttt{RngStream} objects are advanced by multiplication involving
known matrices that are available as part of the package. A function
that can be used to carry out these multiplications is
\texttt{matVecModM} that has prototype
\lstset{language=C++,basicstyle={\ttfamily\footnotesize},
showstringspaces=false,numbers=none}
\begin{lstlisting}
void MatVecModM(const double A[3][3], const double s[3],
double v[3], double m)
\end{lstlisting}
This function will perform multiplication of a $3 \times 3$ array
\texttt{A} times a three-element array \texttt{s} modulo \texttt{m}
and return the result in the three-element array \texttt{v}. For our
purposes we will take \texttt{m} to be either $m_1 = 4294967087$ or
$m_2 = 429494443$, $A$ to be either $A_1^{127}$ or $A_2^{127}$ and
\texttt{s} to be a seed vector that was allocated to a previous
process. The moduli and matrices (as well as \texttt{MatVecModM}) are
contained in an anonymous namespace within\texttt{RngStream.cpp} under
the names \texttt{m1}, \texttt{m2}, \texttt{A1p127} and
\texttt{A2p127}. Thus, the following function will manually advance
the input array \texttt{seedIn} $2^{127}$ states to \texttt{seedOut}
which occupies the same relative location in the next stream.
\lstset{language=C++,basicstyle={\ttfamily\footnotesize},
showstringspaces=false,numbers=none}
\begin{lstlisting}
static void AdvanceSeed(unsigned long seedIn[6],
unsigned long seedOut[6]){
double tempIn[6]; double tempOut[6];
for(int i = 0; i < 6; i++) tempIn[i] = seedIn[i];
MatVecModM (A1p127, tempIn, tempOut, m1);
MatVecModM (A2p127, &tempIn[3], &tempOut[3], m2);
for(int i = 0; i < 6; i++)
seedOut[i] = tempOut[i];
}
\end{lstlisting}
We wrapped this function with the necessary supporting code from
\texttt{RngStream.cpp} in a supplemental class \texttt{RngStreamSupp}
whose header file \texttt{RngStreamSupp.h} appears in the code below.
Our \texttt{MPI} implementation of the \texttt{RngStreams} methodology
now takes the following form.
\lstset{language=C++,basicstyle={\ttfamily\footnotesize},
showstringspaces=false,numbers=none}
\begin{lstlisting}[frame=lines,caption={ranMPI.cpp}, label=ranmpi]
//ranMPI.cpp
#include <iostream>
#include "mpi.h"
#include "RngStream.h"
#include "RngStreamSupp.h"
int main(){
unsigned long seed[6] ={1806547166, 3311292359,
643431772, 1162448557,
3335719306, 4161054083};
MPI::Init();//start MPI
int myRank = MPI::COMM_WORLD.Get_rank();//get process rank
int nP = MPI::COMM_WORLD.Get_size();//get number of processes
if(myRank == 0){
//generate deviate for master process
RngStream::SetPackageSeed(seed);
RngStream Rng;
double U = Rng.RandU01();
//now send seeds to the other processes
unsigned long tempSeed[6];
for(int i = 1; i < nP; i++){
RngStreamSupp::AdvanceSeed(seed, tempSeed);
MPI::COMM_WORLD.Send(tempSeed, 6, MPI::UNSIGNED_LONG,
i, 0);
for(int j = 0; j < 6; j++)
seed[j] = tempSeed[j];
}
std::cout << U << " ";
//collect the other random deviates
for(int i = 1; i < nP; i++){
MPI::COMM_WORLD.Recv(&U, 1, MPI::DOUBLE, i, 0);
std::cout << U << " ";
}
std::cout << std::endl;
}
else{
unsigned long mySeed[6];
//receive your seed if you are not the master process
MPI::COMM_WORLD.Recv(mySeed, 6, MPI::UNSIGNED_LONG, 0, 0);
RngStream::SetPackageSeed(mySeed);
RngStream Rng;
double U = Rng.RandU01();
//send deviate to master process
MPI::COMM_WORLD.Send(&U, 1, MPI::DOUBLE, 0, 0);
}
MPI::Finalize();//end MPI
return 0;
}
\end{lstlisting}
The idea behind Listing \ref{ranmpi} is straightforward. The master
process uses an initial seed in conjunction with \texttt{AdvanceSeed}
to determine seeds that will produce ``independent'' random number
streams for the other processes. These seeds are communicated to the
processes using the \texttt{send} and \texttt{recv} functions. The
processes then use them to initialize their particular
\texttt{RngStream} objects and generate a random uniform value. The
communication process is then reversed and the worker processes send
their random numbers back to the master for printing.
A typical compile and run sequence for our \texttt{MPI} program might
look like
\lstset{language=csh,basicstyle={\ttfamily\footnotesize}}
\begin{lstlisting}
$ mpicxx ranMPI.cpp RngStream.cpp RngStreamSupp.cpp -o ranMPI
$ mpiexec -np 4 ranMPI
0.166374 0.341106 0.312399 0.149433
\end{lstlisting}
This output agrees with our prior results from the \texttt{OpenMP}
program.
An alternative approach that leads to the same result as in
\texttt{ranMPI.cpp} is to mimic the array formulation that was used in
\texttt{ranOpenMP.cpp}. This reduces the amount of communication
with the expense of more memory usage for the individual processes. We
illustrate this approach in Section \ref{sec:ltm}.
\section{RngStreams in R}
\label{sec:r}
The \texttt{RngStreams} backbone generator is now one of the random
number generator options in \texttt{R}. For example,
\lstset{language=R, basicstyle={\ttfamily\footnotesize}}
\begin{lstlisting}
> RNGkind("L'Ecuyer-CMRG")
> set.seed(123)
> runif(1)
[1] 0.1663742
\end{lstlisting}
sets the \texttt{R} uniform random number generator as MRG32k3a, sets
the session's seed and reproduces one of the random deviates from the
previous two sections.
At this point it is perhaps worthwhile to mention how we have aligned
the seeds that are being used in our \texttt{R} and \texttt{C++}
code. The \texttt{set.seed} function in \texttt{R} uses a single
integer to set the seed for the current uniform random number
generator. The result can then be accessed through the integer vector
\texttt{.Random.seed}. In terms of our example with the MRG32k3a
generator, the vector has seven elements: the first one indicates the
type of generator and the remaining six are the seeds. Thus,
\lstset{language=R, basicstyle={\ttfamily\footnotesize}}
\begin{lstlisting}
> set.seed(123)
> cat(.Random.seed[2:7], "\n")
1806547166 -983674937 643431772 1162448557 -959247990 -133913213
\end{lstlisting}
reveals \texttt{R}'s representation for the six seeds that are
required for MRG32k3a. Initially the presence of negative integers
seems a bit puzzling; the \texttt{RngStreams} seeds are stored as
\texttt{unsigned long} in the code for the package. However, in
\texttt{R} they are treated as 32-bit unsigned integers internally but
otherwise viewed as signed integers with a two's-complement
representation. Thus, one must add $2^{32}$ to any negative integers
that appear in \texttt{.Random.seed} to see the actual seeds that were
used with the generator. For example,
\lstset{language=R, basicstyle={\ttfamily\footnotesize}}
\begin{lstlisting}
> .Random.seed[3] + 2^{32}
[1] 3311292359
\end{lstlisting}
gives us the second seed that appeared in our \texttt{C++}
code. Although this may appear to be an esoteric detail, it is
essential for cross-language output comparison.
Another way to access the MRG32k3a generator is through the
\texttt{rstream} package that allows us to create instances of a
derived class \texttt{rstream.mrg32k3a} for the backbone generator.
By default, subsequent \texttt{rstream.mrg32k3a} objects are
initialized to the next stream. In fact, \texttt{rstream} will return
an error if more than one generator requests a specific seed in the
same R session. To override this behavior one must specify the option
\texttt{force.seed = TRUE}. The following code initializes four
generators and uses them with the \texttt{rstream.sample} function to
each produce a uniform random deviate.
\lstset{language=R, basicstyle={\ttfamily\footnotesize}}
\begin{lstlisting}
> library(rstream)
> rngList <- c(new("rstream.mrg32k3a", seed = c(1806547166,
+ 3311292359, 643431772, 1162448557, 3335719306,
+ 4161054083), force.seed = TRUE),
+ replicate(3, new("rstream.mrg32k3a")))
> sapply(rngList, rstream.sample)
[1] 0.1663742 0.3411064 0.3123993 0.1494334
\end{lstlisting}
Note that we have again produced the four independent streams that
arose in the previous two sections.
The MRG32k3a generator is a built in fixture of the \texttt{parallel}
package that offers essentially the same parallel computing
functionality as can be obtained separately from the \texttt{snow} \citep{snow} and
\texttt{multicore} \citep{multicore} packages. In addition, \texttt{parallel} is now
bundled within the \texttt{R} base distribution which suggests it will
be one of the key ingredients in the evolution of \texttt{R}'s future
parallel processing capabilities.
The function \texttt{makeCluster} from the \texttt{parallel} package
will create a (default, socket) cluster whose seed can then be set
with \texttt{clusterSetRNGStream}. The code below illustrates this process
using a cluster of size four.
\lstset{language=R, basicstyle={\ttfamily\footnotesize}}
\begin{lstlisting}
> library(parallel)
> cl <- makeCluster(4)
> clusterSetRNGStream(cl, 123)
> parSapply(cl, rep(1, 4), runif)
[1] 0.1663742 0.3411064 0.3123993 0.1494334
> stopCluster(cl)
\end{lstlisting}
The random numbers were generated using \texttt{parSapply} that gives
a parallel version of \texttt{sapply}; there are parallel versions of
\texttt{lapply} and \texttt{apply} as well as several other related
options. Upon completion of our task, the cluster is terminated with
\texttt{stopCluster}. The output confirms that we are using the
MRG32k3a generator and that the seeds are being advanced correctly for
each node in the cluster.
The other way that \texttt{parallel} provides multithreading ability
(for non-Windows machines) is through \texttt{mclapply}, which
furnishes a parallelized version of \texttt{lapply} suited for a
shared memory environment. An attempt to replicate our cluster results
using this approach produces
\lstset{language=R, basicstyle={\ttfamily\footnotesize}}
\begin{lstlisting}
> library(parallel)
> RNGkind("L'Ecuyer-CMRG")
> set.seed(123)
> unlist(mclapply(rep(1,3),runif))
[1] 0.3411064 0.3123993 0.9712727
> runif(1)
[1] 0.1663742
\end{lstlisting}
which illustrates that the seeds for new streams are not being
advanced as expected. Note that the call mclapply(rep(1,3),runif)
produces three calls to runif with n=1 and collects the results in a
list. This issue is easy enough to fix with
\lstset{language=R, basicstyle={\ttfamily\footnotesize}}
\begin{lstlisting}[frame=lines,caption=mclapply with built-in L'Ecuyer CMRG,label=q:parallel]
> library(parallel)
> RNGkind("L'Ecuyer-CMRG")
> set.seed(123)
> unlist(mclapply(rep(1, 3), runif, mc.cores = 3))
[1] 0.3411064 0.3123993 0.1494334
> runif(1)
[1] 0.1663742
\end{lstlisting}
It is important to specify the desired number of additional cores
(beyond the master process) via the \texttt{mc.cores} option. This
value is not automatically set to the length of the first argument of
\texttt{mclapply}. On Windows platforms, mc.cores must be set to 1 since \texttt{mclapply} relies on the Unix fork command, which is not available in Windows.
It is possible to gain more direct control over the behavior of the
random streams on each core using the \texttt{rstream} package. This
provides a valuable option when reproducibility across multiple
languages and platforms is desired. For example,
\lstset{language=R, basicstyle={\ttfamily\footnotesize}}
\begin{lstlisting}[frame=lines,caption=mclapply with rstream,label=q:rstream]
> library(rstream)
> library(parallel)
> rngList <- c(new("rstream.mrg32k3a",
+ seed = c(1806547166, 3311292359, 643431772,
+ 1162448557, 3335719306, 4161054083),
+ force.seed = TRUE),
+ replicate(3, new("rstream.mrg32k3a")))
> unlist(mclapply(rngList, rstream.sample, mc.cores = 4))
[1] 0.1663742 0.3411064 0.3123993 0.1494334
\end{lstlisting}
returns the results we had anticipated. On Windows platforms, where
\texttt{mc.cores} must be 1, the program in Listing
\ref{q:parallel} will produce numbers from a single stream from
\texttt{RngStreams}. By contrast, the code from Listing
\ref{q:rstream} will faithfully reproduce the results from a multicore
non-Windows platform during serial execution on a Windows machine.
A similar approach of explicitly creating the streams with
\texttt{rstream} may also be used with the cluster approach in order
to maximize transparency.
\section{An application}
\label{sec:ltm}
In this section the basic ideas in Sections
\ref{sec:openmp}--\ref{sec:r} are applied to a real problem involving
Monte Carlo integration. The resulting code listings now become rather
lengthy with the consequence that we can only discuss a few key
aspects here.
The setting is that of \citet{bandl} where a latent trait model (LTM)
is being fit to data obtained from subject responses on the
LSAT. There are five dichotomous variables $Y = (Y_1, \ldots, Y_5)$
whose observed values $y = (y_1, \ldots, y_5)$ produce $2^5 = 32$
possible response patterns. If $y$ is one such pattern, it is assumed
that the conditional probability of seeing $y$ given the value $z$ of
a latent variable $Z\sim N(0,1)$ takes the form
\begin{equation}
\label{eq:model1}
p(y|z) = \Pi_{i = 1}^5 p_i(z)^{y_i}(1 - p_i(z))^{1 - y_i},
\end{equation}
where
\begin{equation}
\label{eq:model2}
p_i(z) = 1/\left(1 + \exp\left\{\alpha_i + \beta z\right\}\right)
\end{equation}
with $\alpha = (\alpha_1, \ldots, \alpha_5)$ a vector of variable
specific intercepts and $\beta$ a common slope parameter. This leads
to
\begin{eqnarray}
\label{eq:model3}
p(y) &:=& \text{E}_z[p(y|z)] \nonumber\\
&=& \int_{-\infty}^{\infty} p(y|z)\phi(z)dz
\end{eqnarray}
with $\phi$ the standard normal density. We will focus on
approximation of this integral via Monte Carlo methods. Other
integrals such as those obtained by differentiation of
(\ref{eq:model3}) with respect to $\alpha$ and $\beta$ that arise in
the computation of marginal maximum likelihood estimators can be
handled analogously.
The choice of $\alpha, \beta$ is arbitrary for our purposes and,
accordingly, we take them to be the marginal likelihood estimates
returned from the \texttt{grm} function in the \texttt {R}
\texttt{ltm} package \citep{ltm}; it is these values that will be seen in our
code. What requires somewhat more justification is our focus on an
example with a single latent variable. In general, there may be many
latent variables in an LTM that make the integral in (\ref{eq:model3})
multidimensional. As detailed in Section 1.1 of \citet{lemieux}, the
number of points in product-rule quadrature must grow exponentially
with the dimension of the integrand to maintain a constant rate of
decay for the approximation error. In contrast, the expected error
from a Monte Carlo integral estimator is independent of dimension and
is always of order $1/\sqrt{n}$ with $n$ the number of sampling
points. Thus, one concludes that Monte Carlo integration is more
appropriate in higher dimensions and, for example, univariate
integrals are better and more precisely evaluated by deterministic
quadrature. On the other hand, the basic premise that underlies Monte
Carlo integration remains the same regardless of the dimension:
repeatedly generate values from a probability distribution, evaluate
the integrand at these values and average the results in some general
sense. Consequently, there is little conceptual downside to our use of
a single latent variable in this instance while the payoff in terms of
code length and mathematical simplification is rather substantial.
For any specified pattern $y$, a Monte Carlo estimator of
(\ref{eq:model3}) will typically have the form
\begin{equation}
\label{eq:probfunc}
\mathrm{probFunc}(y) = \sum_{i = 1}^nW(z_i)p(y|z_i)
\end{equation}
for $z_1, \ldots, z_n$ standard normal deviates and $W$ a suitable
weight function. The serial aspect of our computation problem is
primarily about evaluation of (\ref{eq:probfunc}). We will therefore
deal with that issue first before turning to the parallelization step.
In both cases the \texttt{R} and \texttt{C++} code will be developed
in tandem to better illustrate the similarities and differences that
arise when attempting to solve this type of problem in the two
different languages.
The \texttt{C++} code (included in \texttt{funclass.cpp} in the
supplementary material) that evaluates (\ref{eq:probfunc}) is
\lstset{language=C++,basicstyle={\ttfamily\footnotesize},showstringspaces=false}
\begin{lstlisting}[frame=lines,caption={probFunc in C++}, label=probfuncc]
TNT::Array1D<double>
funClass::probFunc(const TNT::Array1D<double>& Z,
const TNT::Array2D<int>& patn){
//pattern probability array used in intermediate calculations
TNT::Array2D<double> probMat(patn.dim1(), patn.dim2(), 0.);
//work array
TNT::Array1D<double> temp(patn.dim1(), 0.);
//array that will hold the cell probabilities
TNT::Array1D<double> prob(patn.dim1(), 0.);
//sum across normal deviates
for(int k = 0; k < Z.dim(); k++){
probMat = probMatFunc(Z[k], patn);
//multiply to get the probability for each pattern
for(int i = 0; i < patn.dim1(); i++){
temp[i] = 1.;
for(int j = 0; j < patn.dim2(); j++)
temp[i] *= probMat[i][j];
temp[i] *= 2.*dnorm(Z[k])
/(dnorm(Z[k]/2.)*(double) Z.dim());
}
prob += temp;
}
return prob;
}
\end{lstlisting}
This function takes and returns arguments from the Template Numerical
Toolkit (TNT) that provides implementations of vector (as
\texttt{Array1D} objects) and matrix (as \texttt{Array2D} objects)
classes with overloaded addition, multiplication, etc.,
operators. These array classes are implemented as templates which has
the consequence that all code is placed in header files that can be
downloaded directly from \url{http://math.nist.gov/tnt}.
The call to \texttt{probMatFunc} that appears here corresponds to a
function that evaluates $p_i(z)^{y_i}(1 - p_i(z))^{1 - y_i}$ across
all (32 in this instance) possible response patterns for a given value
of $z$ and returns the result as an array with columns representing
the different variables in the model. The values that are used for the
$z$ argument are passed into the function in the \texttt{Array1D}
object \texttt{Z}. These are chosen to be pseudo-random normal
deviates with 0 mean and standard deviation 2. This feature in
conjunction with the weights that are applied at the end of the most
interior (or \texttt{i}) loop produces a simple importance sampler
estimate of the target integral. The code for \texttt{probFunc},
\texttt{probMatFunc} and several utility functions (e.g., the standard
normal density and quantile function designated as \texttt{dnorm} and
\texttt{qnorm}, respectively, in our code) have been collected into a
\texttt{C++} class \texttt{funClass} that contains a class member to
hold the values of the coefficients.
The \texttt{R} version of Listing \ref{probfuncc} (see
\texttt{LTM\_parallel.R} in the supplement) is embodied in three
functions. First, there is
\lstset{language=R, basicstyle={\ttfamily\footnotesize}}
\begin{lstlisting}
> PFunc <- function(z, coefVec, patn) {
+ ncells <- nrow(patn)
+ probMat <- ProbMatFunc(z, coefVec, patn)
+ p <- matrix(0, ncells, 1)
+ # multiply the probabilities across variables
+ p <- as.matrix(dnorm(z) * apply(probMat, 1, prod), ncells, 1)
+ p
+ }
\end{lstlisting}
that evaluates (\ref{eq:model1}) across all response patterns (i.e.,
$y$ values) for a given value of the latent variable (i.e., $z$). The
call to \texttt{ProbMatFunc} in this instance is for the \texttt{R}
analog of its \texttt{C++} namesake in Listing \ref{probfuncc}. The
\texttt{patn} argument is an input array that holds the response
patterns as in the \texttt{C++} case. However, in contrast to the
\texttt{C++} development, in this instance the coefficient vector that
is needed for evaluating (\ref{eq:model2}) is treated as an input
variable, named \texttt{coefVec}, that is passed to this and other
functions that use it directly from a driver program.
Next, the function \texttt{PFunc} is vectorized via
\lstset{language=R, basicstyle={\ttfamily\footnotesize}}
\begin{lstlisting}
> VecPFunc <- Vectorize(PFunc, vectorize.args = "z")
\end{lstlisting}
to handle the array case and thereby allow us to use it with
\texttt{apply} for ``averaging'' across normal deviates as in
\lstset{language=R,basicstyle={\ttfamily\footnotesize},showstringspaces=false}
\begin{lstlisting}[frame=lines,caption={ProbFunc in R}, label=probfuncr]
> ProbFuncparLapply <- function(nSim, nP, coefVec, patn) {
+ Z <- qnorm(runif(round(nSim/nP, 0)), 0, 2)
+ prob <- colMeans(apply(VecPFunc(Z, coefVec = coefVec,
+ patn = patn),
+ 1, function(y) y/dnorm(Z, 0, 2)))
+ }
\end{lstlisting}
The \texttt{Z} array that appears in this listing is, again, a
vector of pseudo-random normal deviates with 0 mean and standard
deviation 2 that is used in the same weighting scheme as for the
\texttt{C++} code.
The \texttt{C++} \texttt{MPI} code (available from
\texttt{driverLTM\_mpi.cpp} in the supplement) that calculates
(\ref{eq:probfunc}) in parallel looks like
\lstset{language=C++,basicstyle={\ttfamily\footnotesize},
showstringspaces=false}
\begin{lstlisting}
RngStream::SetPackageSeed(seed);
RngStream RngArray[nP];
//array for random numbers
TNT::Array1D<double> Z(nSim[myRank], 0.);
//funClass object
funClass f(pAlpha, beta, patn.dim2());
//array that will hold approximate probabilities
TNT::Array1D<double> prob(patn.dim1(), 0.);
//approximate the cell probabilities on each process
for(int i = 0;i < nSim[myRank]; i++)
Z[i] = 2.*f.qnorm(RngArray[myRank].RandU01());
prob = f.probFunc(Z, patn);
//now gather up the results on the master process
double* temp;
temp = new double[nP*patn.dim1()];
double* pData_ = new double[patn.dim1()];
for(int i = 0; i < patn.dim1(); i++)
pData_[i] = prob[i];
MPI::COMM_WORLD.Gather(pData_, patn.dim1(), MPI::DOUBLE,
temp, patn.dim1(), MPI::DOUBLE, 0);
if(myRank == 0){
std::cout << "Integration results:"<<std::endl;
for(int i = 0; i < prob.dim(); i++){
for(int j = 1; j < nP; j++)
prob[i] += temp[j*prob.dim() + i];
prob[i] /= (double)nP;
std::cout<<prob[i]<<std::endl;
}
std::cout << "time with " << nP << " processes was " <<
MPI::Wtime() - wtime << std::endl;
}
\end{lstlisting}
Here we have employed the scheme from Section \ref{sec:openmp} for
obtaining independent streams: each processes creates the same array
of \texttt{RngStream} objects and then uses the array element that
correspond to its rank. The number of normal deviates that must be
generated by the different processes is portioned out in an array
\texttt{nSim} that is used in the same manner. The \texttt{funClass}
object \texttt{f} then provides the means for each process to
transform the uniform deviates obtained from the \texttt{RngStream}
object into standard normals and call \texttt{probFunc} to approximate
the integral.
The use of a common \texttt{RngStream} object array has allowed us to
circumvent the need for sending information to the processes. It is
only necessary to collect up the results they produce. One means to
accomplish this directly, without looping, is by use of \texttt{MPI}'s
\texttt{Gather} function which is the path we have chosen in this
instance. All processes call this function with a pointer argument for
the quantity that is to be passed to the master (or 0) process. This
data is then received (by the master) in a pointer of dimension
sufficient to hold the collective result (i.e., the product of the
number of processes \texttt{nP} and the length \texttt{prob.dim()} of
the \texttt{prob} array). The pointer that is passed from the worker
processes needs to be one that points to the actual data that is held
in the \texttt{prob} array. This resides in a pointer named
\texttt{data\_} that is a private member of the \texttt{Array1D}
class. However, its content is accessible through an overloaded
\texttt{[]} operator and we used this feature to manually transfer the
information in {\texttt{data\_} to the pointer \texttt{pData\_} that
was passed to \texttt{Gather}. Once all the integral approximations
are collected, the master process averages them across processes to
obtain the final approximation.
The \texttt{OpenMP} analog of our \texttt{MPI} code is shown in the
next listing, and is available from \texttt{driverLTM\_openmp.cpp} in
the supplement.
\lstset{language=C++,basicstyle={\ttfamily\footnotesize},
showstringspaces=false}
\begin{lstlisting}
//pointer to Array1D objects that will hold output
//from each process
TNT::Array1D<double>* tempVec = new TNT::Array1D<double>[nP];
//pointer for random numbers and funClass object
double* pZ;
funClass* pf;
//initialize array for cell probabilities
TNT::Array1D<double> prob(patn.dim1(), 0.);
//approximate the cell probabilities
#pragma omp parallel private(myRank, pZ, pf)
{
myRank = omp_get_thread_num();
pZ = new double[nSim[myRank]];
pf = new funClass(pAlpha, beta, patn.dim2());
for(int i = 0;i < nSim[myRank]; i++)
pZ[i] = 2.*pf->qnorm(RngArray[myRank].RandU01());
tempVec[myRank]
= pf->probFunc(TNT::Array1D<double>(nSim[myRank],
pZ), patn);
delete[] pZ;
delete pf;
}
for(int i = 0; i < nP; i++){
prob += tempVec[i];
}
std::cout << "Integration results:"<<std::endl;
for(int i = 0; i < patn.dim1(); i++){
prob[i] /= (double)nP;
std::cout<<prob[i]<<std::endl;
}
std::cout <<"time with " << nP
<< " processes was " << omp_get_wtime() - wtime
<< std::endl;
\end{lstlisting}
An array of \texttt{RngStream} objects is used to generate the random
uniforms for each process exactly as in Section \ref{sec:openmp}. What
differs from the \texttt{MPI} treatment is the way we have used
pointers for the \texttt{funClass} object, the array of standard normals,
and storage of the output from \texttt{probFun}. By designating the
first two of these pointers as \texttt{private} in the
\texttt{private} clause at the start of the parallel section, we avoid
any race conditions that might arise in calling constructors. Each
thread can safely create the object and array they will need to
perform their task. The now familiar mapping of a processes'
rank to an array element index is used to store the \texttt{Array1D}
objects that each thread produces with \texttt{probFunc} in a safe
location provided by the \texttt{Array1D} pointer. After all threads
collapse back to the master at the end of the parallel region, their
output is averaged using an overloaded addition assignment operator
for the \texttt{Array1D} class.
In the \texttt{R} setting our parallel implementation takes the form
\lstset{language=R, basicstyle={\ttfamily\footnotesize}}
\begin{lstlisting}
> library(parallel)
> DriverLTMparLapply <- function(patn, coefVec, nSim, nP) {
+ cl <- makeCluster(nP)
+ clusterSetRNGStream(cl, 123)
+ clusterExport(cl, c("patn", "coefVec", "nSim", "nP"))
+ clusterExport(cl, c("ProbMatFunc", "PFunc", "VecPFunc",
+ "ProbFuncparLapply"))
+ result <- parLapply(cl, seq_len(nP),
+ function(...) ProbFuncparLapply(nSim,
+ nP, coefVec, patn))
+ stopCluster(cl)
+ # now average across threads
+ prob <- vector(mode = "double", length = nrow(patn))
+ for (i in 1:nP) {
+ prob <- prob + result[[i]]
+ }
+ prob/nP
+ }
\end{lstlisting}
in a cluster context and
\lstset{language=R, basicstyle={\ttfamily\footnotesize}}
\begin{lstlisting}
> library(parallel)
> library(rstream)
> DriverLTMmclapply <- function(patn, coefVec, nSim, nP) {
+ rngList <- c(new("rstream.mrg32k3a",
+ seed = c(1806547166, 3311292359, 643431772,
+ 1162448557, 3335719306, 4161054083),
+ force.seed = TRUE),
+ replicate(nP - 1, new("rstream.mrg32k3a")))
+ result <- mclapply(rngList, ProbFuncmclapply, nSim, nP,
+ coefVec, patn, mc.cores=nP)
+ # now average across processes
+ prob <- vector(mode = "double", length = nrow(patn))
+ for (i in 1:nP) {
+ prob <- prob + result[[i]]
+ }
+ prob/nP
+ }
\end{lstlisting}
with
\lstset{language=R, basicstyle={\ttfamily\footnotesize}}
\begin{lstlisting}
> ProbFuncmclapply <- function(rngObj, nSim, nP, coefVec, patn) {
+ Z <- qnorm(rstream.sample(rngObj, round(nSim/nP, 0)), 0, 2)
+ prob <- colMeans(apply(VecPFunc(Z, coefVec = coefVec,
+ patn = patn),
+ 1, function(y) y/dnorm(Z, 0, 2)))
+ }
\end{lstlisting}
for shared memory purposes. Note that the various constants and
functions must be communicated to the workers in a cluster with
\texttt{clusterExport} and that we have used the \texttt{rstream}
package in the same manner as Section \ref{sec:r} to ensure proper
stream initialization when using \texttt{mclapply}. Both
\texttt{parLapply} and \texttt{mclapply} return lists that hold the
output from each of the workers. The list element are then averaged to
obtain the final approximation.
The programs print matching output for the integral approximations when the number of threads is held constant. For example, using four threads with the R \texttt{mclapply} function produces
\lstset{language=R, basicstyle={\ttfamily\footnotesize}}
\begin{lstlisting}
> DriverLTMmclapply(patn,coefVec,nSim,4)
[1] 0.010591032 0.014465339 0.003321068 0.007459930 0.008497401
[6] 0.019087239 0.004382200 0.016301170 0.004348307 0.009767360
[11] 0.002242468 0.008341667 0.005737658 0.021343283 0.004900161
[16] 0.031011491 0.012769320 0.028683011 0.006585274 0.024496295
[21] 0.016849315 0.062677083 0.014389904 0.091068923 0.008622165
[26] 0.032073241 0.007363630 0.046601970 0.018840844 0.119237442
[31] 0.027375482 0.310245431.
\end{lstlisting}
The C++ OpenMP program with four threads produces identical output, as expected:
\lstset{language=C++,basicstyle={\ttfamily\footnotesize},
showstringspaces=false}
\begin{lstlisting}
$ g++ -fopenmp RngStream.cpp funclass.cpp RngStreamSupp.cpp
driverLTM_openmp.cpp -o openmp_ltm
$ ./openmp_ltm 4
Integration results:
0.010591032 0.014465339 0.003321068 0.007459930 0.008497401
0.019087239 0.004382200 0.016301170 0.004348307 0.009767360
0.002242468 0.008341667 0.005737658 0.021343283 0.004900161
0.031011491 0.012769320 0.028683011 0.006585274 0.024496295
0.016849315 0.062677083 0.014389904 0.091068923 0.008622165
0.032073241 0.007363630 0.046601970 0.018840844 0.119237442
0.027375482 0.310245431.
\end{lstlisting}
Though not shown, the C++ MPI and R \texttt{parLapply} programs also produce the same output. When the number of threads are changed, the approximations change as well since the random numbers are generated from different streams.
We carried out a few performance comparisons using the \texttt{MPI}
and \texttt{OpenMP} \texttt{C++} code and the \texttt{mcapply} and \texttt{parLapply}
\texttt{R} code for the Monte Carlo integration application. The average speedups
(i.e., the ratios of serial to parallel run times) and run times (in
parentheses) are given in Table \ref{tab:t1}. In all cases we used
five runs with $10^5$ sampling points (i.e., \texttt{nSim = $10^5$}) and
used 1, 2, 4 and 8 processors. While all of the programs perform well, near-linear speedup is achieved with \texttt{mclapply} in R (using the \texttt{rstream} package) and in with MPI in C++. As expected, the C++ programs execute at least an order of magnitude faster than the R programs.
\begin{table}[h!!!!]
\begin{center}
\caption{Speedups and (run times, in seconds) for \texttt{C++} and \texttt{R} code}
\label{tab:t1}
\begin{tabular}{ccccc} \\ \hline
Processors & mclapply&parLapply & MPI & OpenMP
\\
\hline
1 &1.000 &1.000 &1.000 &1.000 \\
&(77.941)& (82.541) & (2.254) & (2.199) \\
2&2.070 & 1.950 &2.014 & 1.430 \\
& (37.651)&(42.319) & (1.119) & (1.538)\\
4 &4.117& 3.597 & 4.002 & 3.043 \\
& (18.930)&(22.944) & (0.563) & (0.723) \\
8 &7.884& 5.549 &7.545& 6.595\\
& (9.886)&(14.874) & (0.299) & (0.333) \\
\hline
\end{tabular}
\end{center}
\end{table}
As a closing thought, we mention that the accuracy obtained with $10^5$
samples can be roughly compared to that of an 18 (uniformly spaced)
point trapezoidal quadrature rule in one dimension. However, a grid of
$10^{10}$ points for a product trapezoidal rule would be required to
remain competitive with the same Monte Carlo scheme in 8 dimensions,
for example.
\section{Summary}
\label{sec:summary}
The \texttt{RngStreams} software package provides one freely available
solution for creating independent random number streams for simulation
experiments that are conducted in a parallel computing
environment. The goal of this paper has been to provide an
introduction to its use in both distributed and shared memory
settings. We have provided minimal working examples along with a more
detailed application to illustrate the potential of
\texttt{RngStreams}.
While \texttt{RngStreams} is easy to use, some care is necessary to
ensure that streams are distributed correctly to different
processes. We have accomplished this by using an array of
\texttt{RngStream} objects where the array index is employed to
uniquely determine which process may access each object. This scheme
is guaranteed to produce independent streams in both \texttt{OpenMP}
and \texttt{MPI}. In the latter context, we have described another
option that creates seeds on the master process and then distributes
them to all the worker processes. There are memory savings from this
latter approach that come with the cost of additional inter-processor
communication.
In \texttt{R} we demonstrated the use of \texttt{RngStreams} through
both the \texttt{parallel} and \texttt{rstream} packages. We explored
the built-in functionality for \texttt{RngStreams} in
\texttt{parallel} and demonstrated how \texttt{rstream} may be used
within \texttt{parallel} to improve the portability of the software.
\section*{Acknowledgements} Some of the computations
were carried out on the Saguaro cluster at Arizona State
University. The authors are grateful for comments from the anonymous referees
that lead to improvements in the paper.