Optimizing Matrix Operations
This investigation was inspired by my efforts to optimize computation time for a plantcanopy lightandenergybalance model called SCOPE. The canopy is broken down into 60 layers, each containing a distribution of leaves at different angles, represented as a matrix of 60 layers by 13 angles of inclination by 36 angles of azimuth. The model is written in MATLAB, and since MATLAB doesn't support 3D matrix operations, the original code used a FOR loop to iterate over each layer. If you're familiar with vectororiented "scripting" languages, you will not be surprised that this method slowed things down considerably. My solution was to reshape the array into a into a 2D matrix of 60 sidebyside 13x36 submatrices (so the new matrix is 13 x 2160), which increased the speed 10fold.
Having done this in MATLAB I set up a simple benchmark to see how other languages dealt with this data size vs. speed issue.The answer turned out to be more complicated than I had expected. While I had anticipated the first issue in this list, the remaining issues affecting speed were a bit of a surprise:
The first item is wellknown: to the extent that you can take advantage of "consolidated" operations, your interpreted language (MATLAB, R, Python, etc.) will benefit.
The next three are interrelated: even though the benchmark is singlethreaded, MKL will create its own threads to distribute matrix operations across all available cores if the matrix is larger than a certain size. In my benchmark multithreading kicked in somewhere between 4,500 and 45,000 elements (I didn't explicitly probe this question). So which languages use MKL? Once again, the answer isn't simple. MATLAB uses it transparently. Python's numpy will use it if you're doing matrix operations. R, by default does not use MKL but you can either replace the BLAS in standard R or you can get Revolution R (now Microsoft R Open) in which case it will always be used. The other languages, by default, won't use MKL unless you can get particular libraries. The last two bullet points, above, depend on the language. As a simple example, MATLAB forloops are notoriously slow if you append to a matrix on each iteration: in the tables below, the "best" coding method has the result matrix preallocated before the forloop; the "worst" method does not. In Python it turns out that the row and colMeans functions do not use the BLAS; using a handcoded replacement function that computes means as a matrix operation makes it 47x faster. In R, on the other hand, the native "colMeans" function did better than
the matrix algebra method even when using MKLenhanced BLAS. In fact, R
benefitted less from MKL than did Python. Java 8 and C++ do not have builtin matrix libraries. Instead I used a nativeJava matrix class that I had written for a different project (and translated to C++ for this one). The "best" speed in Java is probably not the best possible and is therefore even more impressive in its "native" speed. The benchmark ran a simple set of functions that
compute the column (or row) means of a matrix with a fixed number of
columns (or rows). The functions were called using differentsized matrices, but as matrix size increased, the number of forloop iterations decreased to keep the actual number of math operations constant  see the third table, below.
Speed (seconds) for best coding method
Speed (seconds) for "worst" coding method
Size vs. Iterations in the previous tables  the two are balanced so the number of math ops are constant
