Restrict Multithreaded Programs When Running in Parallel

A basic and safe rule of thumb in parallel processing in data analysis and scientific computing is don't try to do more things at once than you have compute cores. This can to be subtly hard to do, and understand, with modern software and computer architectures however. 

Your personal computer can run many, many things (processes, threads) simultaneously because "typical" workloads are mainly idle at the timescale of the CPU clock. For example, my 4-core i7 MacBook Pro has 411 things running right now. Analytics work, however, can be mainly arithmetic operations that can saturate a computer's instruction stream. You basically want to think of cores/processes/threads in a 1-1 fashion: making the OS "switch context" between one thread and another is very expensive, relative to the speed of arithmetic, and will really slow down your work.

This can become a subtle problem when we use programs that automatically exploit multicore parallelism. That is, if we simultaneously run a script that executes a program that itself asks for multiple threads to do work, we can end up creating contention for resources and cannibalizing our own work. There are many such programs in use at GSB, most prominently perhaps Stata/MP and MATLAB. 

Let's review this concept using a basic example in MATLAB (2016b and 2018b). Most of our results come from running MATLAB in AWS EC2 on a 244GB 16 physical / 32 logical core machine much like the yen servers, but without sharing computing so we know the only thing running is our example. We also show results from one of the yen servers, yen2, when it was quiet and nothing else was running. 

We'll show that if you want to run multiple computational workloads in parallel, i.e. "concurrently", you should explicitly restrict the number of threads your code uses. Using default multithreading settings is really bad; often probably the worst you could ask for. Running single-threaded is safe and easy, but not necessarily optimal. If you want to load up a machine with as many concurrent tasks as cores, you're probably best off running each in a single thread only. Unfortunately understanding the best possible setting for the number of threads is hard in general, and probably entirely case-specific depending on both the code and hardware. The TL:DR; is that using the default threading settings is probably really, really bad for you when running many things at once. 

While we test this in MATLAB, consider these comments about Stata/MP from Stata itself

In general, you will get the best performance by using all processors available, leaving set to the default. If you are running a large Stata job in the background, however, you may want to reduce the maximum number that Stata/MP will use to have better performance in your foreground tasks. If you are running two large Stata jobs in the background, you may get slightly better performance if you restrict each to using half the number of processors.

Notice the suggestion that you half the number of processors for two jobs; that is, run using only the number of cores available presuming Stata/MP is licensed to use all the cores on the machine (i.e., "using all processors available"). 

None of this applies to large-scale work that requires multithreading to undertake successfully at all. But you shouldn't expect to be able to run such work concurrently on a single machine. 

Our Example Problem

Matrix inversion or, rather, solving linear systems is a fundamental problem in computing. This is, for example, what happens when we compute OLS-style linear regressions (one way or another). MATLAB is perfect for, and in fact built for, this type of problem. It probably couldn't be better at doing it. So we'll use, time, and parallelize the following simple function: 

        function [] = invert( N )
            x = rand(N,N) \ rand(N,1)
        end 

This function creates a random NxN matrix "A" (unlabeled in the program) and Nx1 right-hand side "b" and solves for "x" satisfying "Ax = b". We can time calls to this function using the built in timeit. Specifically, the simple MATLAB program

        fprintf('time: %0.6f\n',timeit(@()(invert(5000))));       

will run invert(5000) multiple times and report back the median runtime, with something like

        time: 0.830151

This is a real outcome on our 32-core, 244GB EC2 instance, albeit just a sample. 

We will run this function from a bash shell or script essentially as follows: 

        matlab -r "fprintf('time: %0.6f\n',timeit(@()(invert(${N})))); exit" \
            | sed -n '/time: /p' >> ${LOGFLE}

This command will start MATLAB and run our timing call, using sed to print only the time line when done, as a single command line call. We'll presume the bash variables N and LOGFLE exist. For now, say we put this command in a file script__.sh

Running Multithreaded

By default MATLAB obtains this result by spreading the calculation over as many of the (physical) cores on the machine as possible. You can see what MATLAB thinks it can use by using the maxNumCompThreads variable/function

        >> maxNumCompThreads
        
        ans = 
            
            16

The answer is 16, not 32, because we haven't disabled hyperthreading on the EC2 instance (which we really should, and examine later). 

A simple "watcher" script helps us see this outside of MATLAB. Open a text editor and create the following file: 

        #!/bin/bash
        bash ${1} > out.log & 
        PID=$!
        while [[ 1 ]] ; do 
                top -b -n 1 | sed -n "/${USER}.*MATLAB/p' \
                    | awk -v T="${T}" '{ print T"\t"$1"\t"$9"\t"$10"\t"$11" }'
                E=$( ps ${PID} | wc -l ); E=$(( E - 1 ));
                if [[ ${E} -le 0 ]] ; then echo "done"; break ; fi
                sleep 1
        done

What this script does is run another script passed in an argument ("${1}") and, for every second ("sleep 1") until the process stops ("E=$..." and "if[[${E}"), prints the PID, CPU%, MEM%, and duration output of top for any MATLAB processors running for the same user, prepending a timestamp to each row (the complicated top|sed|awk business). Executing 

        ./watch.sh script__.sh

prints this sequence of reads out to the console until the process completes. For example, 

2018-11-01T20:46:10 60764 27.8 0.0 0:00.66

2018-11-01T20:46:11 60764 188.2 0.1 0:01.93

2018-11-01T20:46:12 60764 72.2 0.1 0:03.18

2018-11-01T20:46:14 60764 5.6 0.1 0:03.22

2018-11-01T20:46:15 60764 105.6 0.1 0:03.87

2018-11-01T20:46:16 60764 300.0 0.2 0:09.95

2018-11-01T20:46:17 60764 566.7 0.4 0:15.81

2018-11-01T20:46:19 60764 164.7 0.3 0:24.15

2018-11-01T20:46:20 60764 1500 0.4 0:32.11

2018-11-01T20:46:21 60764 164.7 0.3 0:40.63

2018-11-01T20:46:23 60764 1056 0.4 0:48.77

2018-11-01T20:46:24 60764 923.5 0.4 0:58.26

2018-11-01T20:46:25 60764 105.9 0.4 1:04.02

2018-11-01T20:46:26 60764 168.8 0.2 1:13.44

2018-11-01T20:46:28 60764 100.0 0.3 1:14.71

2018-11-01T20:46:29 60764 5.6 0.3 1:14.77

done.


You can see that this uses more than one core's worth of CPU by looking at the 3rd column, which shows a CPU% often higher than 100%. top reads each (logical) CPU core as 100%, so on this machine using all the CPUs all the way would read as 3200%. This run used at least 15 cores around 20:46:20 (UTC), and actually more because it probably didn't use all those cores it ran on all the way to 100%. 15 cores is pretty close to 16 cores though, suggesting that MATLAB probably is trying to maximize parallelism with default settings. The actual linear system solve time output from MATLAB is, like above, 

        time: 0.827973

This is the median solve time over multiple runs of invert, although the actual number is irrelevant. We'll be interested in relative figures, not absolute ones. 

Running Single-Threaded

Let's tell MATLAB to not run on multiple cores, and see what happens. We can easily restrict MATLAB to run in a single thread with the extra command line flag -singleCompThread
 
        matlab -r "fprintf('time: %0.6f\n',timeit(@()(invert(${N})))); exit" \
            -singleCompThread | sed -n '/time: /p' >> ${LOGFLE}

Let's say we put this in script_1.sh, and run

    
    ./watch.sh script_1.sh
        
A sample run prints 

2018-11-01T20:53:40 71966 0.0 0.0 0:00.68

2018-11-01T20:53:42 71966 162.5 0.1 0:01.86

2018-11-01T20:53:43 71966 100.0 0.1 0:03.13

2018-11-01T20:53:44 71966 5.9 0.1 0:03.26

2018-11-01T20:53:45 71966 100.0 0.1 0:03.62

2018-11-01T20:53:47 71966 94.1 0.3 0:05.09

2018-11-01T20:53:48 71966 100.0 0.3 0:06.33

2018-11-01T20:53:49 71966 100.0 0.2 0:07.59

2018-11-01T20:53:50 71966 94.4 0.3 0:08.84

2018-11-01T20:53:52 71966 100.0 0.3 0:10.09

2018-11-01T20:53:53 71966 94.4 0.3 0:11.34

2018-11-01T20:53:54 71966 100.0 0.3 0:12.60

2018-11-01T20:53:55 71966 100.0 0.2 0:13.88

2018-11-01T20:53:57 71966 100.0 0.3 0:15.12

2018-11-01T20:53:58 71966 105.9 0.3 0:16.37

2018-11-01T20:53:59 71966 100.0 0.3 0:17.63

2018-11-01T20:54:00 71966 100.0 0.3 0:18.89

2018-11-01T20:54:02 71966 100.0 0.3 0:20.14

2018-11-01T20:54:03 71966 94.4 0.3 0:21.40

2018-11-01T20:54:04 71966 100.0 0.3 0:22.67

2018-11-01T20:54:06 71966 111.8 0.2 0:23.95

2018-11-01T20:54:07 71966 0.0 0.2 0:24.02

done.


and the associated solve time is 

        time: 3.012891

This makes decent sense, of course. Running in a single thread, on a single core, should take longer. Outside of some small bumps, the third column in our top print out suggests that this run uses more or less one core (i.e., numbers approximately below 100). Probably MATLAB is only restricting computing to the single computing thread, and is using some more resources in managing the compute. The memory usage (fourth column) is also about the same, and actually a bit less; but running in multiple cores could only increase memory usage, not decrease it. 

From this view, it kind of looks like running single-threaded would be a big waste: using the default multithreaded MATLAB we could solve almost 4 systems to every one we can solve with single-threaded MATLAB. (If you want an early hint that this is wrong, compare the fifth columns -- actual CPU time used -- in the single- vs multi-threaded examples above.) 

Parallelization and Contention

The big fallacy in this line of thinking is as follows: Multithreaded (default) MATLAB tried to use the whole machine, or at least as much of it as it thought it could: 16 (physical) cores. The single-threaded MATLAB used only a single core (for computing at least). We got that "4x" faster performance using the whole machine, and don't have another four machines to exploit. 

As an aside, notice that it wasn't "16x", or really anything even close to it. We threw 16x the resources at the problem, and it only got 4x faster. This is about as good as we get, at least on EC2: later we examine MATLAB with a variety of computational threads and reliably see solve time reduction at about 20% of the increase in resources. 

We can run multiple times concurrently, i.e. in parallel, using GNU parallel, as follows: Make a text file script__.sh with

#!/bin/bash

dorun( ) {

    matlab -r "fprintf('time: %0.6f\n',timeit(@()(invert(${N})))); exit" \

        | sed -n '/time: /p' >> ${LOGFLE}

}

export N=5000

export -f dorun

T=10

export LOGFLE=out__.txt

> ${LOGFLE}

parallel -j${T} -N0 --no-notice dorun ::: $( seq 1 ${T} )

cat ${LOGFLE}

This script will use parallel to run T concurrent jobs doing the linear system solve timer in MATLAB. Running this script prints

    time: 5.956086
    time: 7.928478
    time: 7.784545
    time: 6.637543
    time: 6.751798
    time: 7.037996
    time: 6.936073
    time: 7.391983
    time: 6.760437
    time: 8.270718

the results of 10 different, concurrent runs. The first thing to notice is that the runtime has gotten much, much worse. A single representative solve now takes 6-8 seconds, instead of less than a second, about 8 times as slow. If we just run script__.sh with T=1 10 times in a row, we get

    time: 0.899382
    time: 0.840206
    time: 0.798558
    time: 0.788697
    time: 0.822649
    time: 0.785605
    time: 0.807331
    time: 0.814396
    time: 0.834981
    time: 0.781169

So the computations appear not just to take longer but also have more variable runtimes. Specifically the 10 concurrent runs have twice the mean-normalized range (max-min) of the serialized runs. 

We can time these runs using a call like 

    time ./script__.sh

This particular example takes about 40.5 "real" or "wall" seconds to run (including "overhead" operations to setup and run this experiment). 

Now let's try this with the single threaded version. Put

#!/bin/bash

dorun( ) {

    matlab -r "fprintf('time: %0.6f\n',timeit(@()(invert(${N})))); exit" \

        -singleCompThread | sed -n '/time: /p' >> ${LOGFLE}

}

export N=5000

export -f dorun

T=10

export LOGFLE=out__.txt

> ${LOGFLE}

parallel -j${T} -N0 --no-notice dorun ::: $( seq 1 ${T} )

cat ${LOGFLE}

in a text file script_1.shRunning this script prints

    time: 4.042069

    time: 4.481772

    time: 4.403858

    time: 4.763622

    time: 4.748626

    time: 4.527987

    time: 4.256381

    time: 4.215290

    time: 4.316044

    time: 4.196000

in about 34 "real" seconds. We undertook the same number of solves, in 12% less real time, and without greatly impacting the estimated time required for any particular, single operation. Specifically, our solves each took about 25% longer 10 at a time when single-threaded, but each took about 775% longer when multithreaded. 

What is happening here is simple to explain: Without restricting MATLAB to run compute in a single thread/core, MATLAB is trying to run in parallel, but so are we. Thus for every computationally intensive "process" or "thread" we wish to create, we are implicitly creating many. Ultimately this means asking for much more than the computer can actually offer. The OS will blindly listen to us, though. What ends up happening is that the OS has to spend more and more time managing these different threads, switching control and memory between them, at great expense relative to the math we're telling it to do. 

Scaling Behavior

This can get even worse: running 24 multithreaded matlab jobs concurrently takes about 1.5 minutes, with solve time estimates between 14 and 20 seconds each. Running 24 single-threaded jobs concurrently takes only about 48 seconds with solve time estimates between 7 and 9 seconds. 

The plot below shows estimated solve time for a range of concurrent process numbers, from one to the number of (logical) cores on the machine (32). If we want to run as few as 6 different jobs concurrently, running single-threaded is cheaper. Running 16 or more concurrent process single threaded -- half the number of (logical) cores or more -- roughly halves (or more) an individual task's required compute time relative to the default multithreaded MATLAB. 



The Unix time command we use also prints out the CPU time used by a process (both "user" and "system") as succinctly described here. We plot this out below, as total CPU time (user + system) used to run the same experiment, single-threaded and multithreaded (MATLAB's default). The multithreaded default takes much, much more actual CPU time in all cases (did you review the fifth column in our top prints above?), and thus resources, to accomplish the same task. These are resources not spent to accomplish the scientific goal, but rather just to manage a mess of competing threads, and resources not available for others to use during these runs. We are wasting not only our own time using the defaults, but also everyone else's time too. 


Yet another way to look at this is plotted below. We take the total CPU time for each case (with more than one process), and divide it by the number of concurrent processes times the CPU time obtained for a single process. This is a relative picture scaling, or how well single- vs. multi-threaded MATLAB utilizes resources as we run more than one copy of the task concurrently. If the plotted line stays at or around 1, then we have nearly perfectly parallelized compute (in terms of CPU time used, not real or "wall" time). Note that "perfect" here doesn't connote "best"; it means that we are getting results seemingly equivalent to parallelizing across completely independent resources. 

The single-threaded MATLAB initially scales very well in this sense; essentially nearly perfectly parallelizing compute up to about 10 concurrent processes. After 10 there is some contention between the single-threaded concurrent processes that up to doubles the amount of work we are putting in, per unit (the line goes up to 2). The multithreaded default is more stable, hovering around a scaling factor of 1.2-1.3 for all but one case. But what is a bit hidden here is that there is always a penalty to running more: As soon as we execute 2 multithreaded runs concurrently, we incur a 20-30% penalty. There is no range where we are scaling well in terms of "just" parallelizing the cycles required to do the computational task at hand. 



Of Course, It Isn't That Simple

We can generalize this by having our MATLAB script specify the number of computational threads to use. Here is the full text of a script,
script_n.sh, to do this: 

    #!/bin/bash
    dorun( ) {
    if [ ! -z ${THREADS} ] ; then
    ${MATLAB} -r "cd ${EXPDIR}; maxNumCompThreads(${THREADS}); \
                    fprintf('time: %0.6f\n',timeit(@()(invert(${N})))); exit" \
                    | sed -n '/time: /p' >> ${LOGFLE}
    else
        ${MATLAB} -r "cd ${EXPDIR}; \
                    fprintf('time: %0.6f\n',timeit(@()(invert(${N})))); exit" \
                    | sed -n '/time: /p' >> ${LOGFLE} 
    fi
    }

    export MATLAB=/home/.software/MATLAB-R2016a/bin/matlab
    export EXPDIR=/home/morrowwr/smartpar/matlab
    export N=5000
    export -f dorun
    
    if [[ $# -ge 2 ]] ; then
        P=$2
        export THREADS=$1
    elif [[ $# -eq 1 ]] ; then
        P=10
        export THREADS=$1
    else
        P=10
    fi

    export LOGFLE=/home/morrowwr/smartpar/matlab/out_${P}_${THREADS}.txt
    > ${LOGFLE}
    parallel -j${P} -N0 --no-notice dorun ::: $( seq 1 ${P} )
    cat ${LOGFLE}

We write another script to loop over 1, 2, 4, 8, 16, and 32 threads with 1, 2, 4, 8, 16, 24, and 32 concurrent processes running script_n.sh for each combination (and format output). The easiest way to view solve time results in this set of experiments is with a column chart: 


Running in a single thread is ok with respect to other numbers of threads, but not best. In a resource allocation sense, single-threaded is dominated in that for every other number of concurrent processes, there is another choice equally as good (up to error) or better: using 4 threads is always better, and using 2 or 8 is always better or the same. Using 16 or 32 threads is a bad idea when running 8 or more concurrent processes. 

There is, however, also overhead in setting up the parallelization of tasks. Our picture is not complete looking just at the task time. The "real" or "wall" time the experiment took for each setting is shown below. From this perspective, using 1, 2, or 4 threads looks pretty similar. Maybe 8 threads too, although not with 24 and 32. Again, using 16 or 32 threads is a bad idea when running more than 8 concurrent processes. If what we're really interested in is our experience time, running single-threaded looks barely suboptimal, running invert(5000) at least. 


Our scaling factor calculation is pretty interesting as well. We see that using 16 and 32 cores scales pretty poorly, again always incurring overhead to run concurrently. Using a single core nearly perfectly parallelizes with 1, 2, 4, and 8 concurrent jobs incurs overhead to run with 16, 24, and 32 concurrent jobs. Using 2, 4, or 8 threads is where things are interesting. These settings scale very well (except 8 threads with 4 or 8 concurrent processes), even saving CPU time in some cases (scaling factor < 1). This efficiency is probably not "we", as the "user" of the invert code would naturally care about though. This efficiency means that other users of the system would benefit from these settings (possibly at some cost to us). 


Without Hyperthreading

Intel processors and Linux themselves often have a layer of concurrency built in. This is called hyperthreading, in which the OS more or less pretends each physical core is two logical processors and schedules work accordingly. Because "typical" workloads are mostly idle, this is a great idea in general. But when doing analytics this is often a terrible idea. We can turn off this setting in AWS EC2 by following these instructions

Perhaps not surprisingly at this point, the picture of performance we get is different yet again working only with the 16 physical cores on the machine. The three plots described above -- solve time, real/wall time, and scaling factors -- are shown below. The default, 16 cores, is basically never best. Only with a single process -- not running in parallel -- is the default the fastest. Any concurrency incurs a penalty sufficient enough to make it slower than some other thread setting. Unless we are packing the whole machine, we should run with 2 or 4 threads; if we are, we should run with a single thread. 


Real time is plotted below. If we're just concerned about real time, we can probably gain with few concurrent processes by using 4 or 8 threads. But acknowledging some variance in these figures suggests that if we're going to use half of or the whole machine, we might as well run single-threaded to get results as fast as we could expect. 


Everything except the default scales pretty well too. Single-threaded MATLAB essentially perfectly parallelizes, while 2, 4, and 8 threads actually get some efficiency (scaling < 1) when running 8 or 16 jobs concurrently. Note the difference here between this picture and the picture above, with hyperthreading. Particularly, running single-threaded scales "well" up to using all cores, unlike with hyperthreading. 



Running on the yen Servers

We have had an opportunity to run our experiment on yen2 with nothing else running. yen2 has MATLAB (2018b), and the default setting here is 32 threads for the 32 physical (non-hyperthreaded) cores. 

The resulting solve time has a familiar picture: For a small number of concurrent runs, we can reduce solve time with the default, relative to single-threaded runs, but when there are many concurrent runs the threads just interfere with each other. However, on yen2, the default isn't even the fastest with a single process! We saw this for our EC2 machine above with concurrent processes but, on yen2, the default isn't even best for a single process. The default is actually the worst for 8 or more concurrent processes. And when we try to fill up the machine with 32 concurrent processes, the default is 4x slower than the best option which is (wait for it...) running single-threaded




Overall, the optimal number of threads for the fastest solve time is patterned, as detailed in the table below:
 
 Concurrent Processes 1 2 4 8 16 32
 Optimal Number of Threads 16 8 8 4 2 1
 Total Number of Threads 16 16 32 32 32 32

The best setting isn't perfectly correlated with the number of cores. The (probable) total number of compute threads -- the last row of the table, obtained by multiplying the top two -- isn't always 32. But if we are going to do as much as we can on the machine, that's probably not a bad heuristic to use. 

We plot real ("wall") time again, below. If what we are worried about is real time, and we expect to run maybe 10 or more concurrent processes, we might as well run single-threaded. If we are running less than 10 at a time, we could gain a bit -- but not a tremendous amount -- by running multithreaded, including using the default which is real-time fastest for 1, 2 and 4 concurrent processes. Parallelizing the processes themselves appears to have a decent cost; large enough to mean that gains we might see on a task-by-task basis can be hidden from the real time we experience. 


Virtually everything scales well on the yens. Running single-threaded again nearly perfectly parallelizes. The trend appears to actually be gaining efficiency with concurrency, possibly a feature of the very large RAM installed on yen2 (1.5 TB). On our EC2 machine with less memory (244 GB), the OS might have had to work harder to load and store the data needed by all the concurrently running threads. Keep in mind, though, that scaling isn't the whole game. The default 32 thread setting scales actual CPU time best in the end, but is still the slowest in terms of both task (solve) time and real (wall) time. 



Conclusions

Unless you are running one or two jobs -- on an empty machine -- you should probably be very skeptical of the efficiency of default, multithreading settings in environments like MATLAB and Stata/MP. You are likely to get more work done, real-time faster, using fewer CPU cycles by restricting your environment to use a single thread, or at least just a few. Ultimately this is a very complicated topic though. If you have to run many different job copies, like in the hundreds (or any reasonable multiple of the number of cores), it's probably worth your time to run some tests to see how multithreading settings in your environment determine speed and resource consumption. 

Our analysis isn't too specific to the particular case of linear system solving, or the size of the linear systems. We see qualitatively equivalent results -- left out of this article -- running invert(1000) and invert(10000) instead of invert(5000) in our scripts. We also see similar results running MATLAB's fft example with various problem sizes. Most importantly, however, real analytics work is likely to be much more complicated than simple cases that can support articles like this. If we can say anything generally, it is probably that multithreading will perform worse with more complicated programs without significant efforts to organize parallelized computing very carefully. 

As mentioned in the introduction, none of this applies to large-scale work that requires multithreading to undertake successfully at all. But you shouldn't expect to be able to run such work concurrently on a single machine. 

Addendum: How Much Do We Get From Multithreading?

Below we plot a scaling factor for solve time with a single process, that is no concurrency, over different multithreading settings. This is to examine the ideal gains we get from parallelization, in MATLAB for linear system solves of course, and how they relate to the resources we deploy to parallelize. Our scaling factor is as follows: For N computational threads, let S(N) denote the median solve time reported by timeit. Define a scaling factor F(N) = S(N)/(S(1)/N) = NS(N)/S(1). This factor tells us what fraction of complete division of single-threaded solve time (S(1)/N) we get with N threads. We plot F(N) versus N for each machine/setting we tried. This includes other matrix sizes in invert and our fft experiments. We also plot a dashed line illustrating what the ideal, adding resources equalling improvements (S(N) = S(1)/N), would look like on this plot. 




These scaling factors are consistently below "perfect" conversion of resources to improvements. For invert(5000) on EC2, adding resources gives us about 20% of the resources we add. We don't show our analysis, but this number comes from fitting a trendline through the points and observing a scale factor about 0.2 (in both cases, with R-squared above 0.99). MATLAB scales better, in this sense at least, on yen2 getting speed increases equal to about 27% of the resources we add except with 32 threads where we actually get a bit more than 30% of what we put in in improvement. Smaller problems like invert(1000) or fft(1500) yield a higher conversion of added resources to improvement, but still only about 50% conversion of additional resources to improvement. A larger problem invert(10000) is worse, with about 12% conversion. We don't see any significantly different behavior with fft instead of invert

All of these results are probably subject to a fair bit of variability, though, and shouldn't be taken too literally.

Comments