Over the course of my tenure as a student of Performance Modeling, I've become someone who is very interested in the littlest of details. We talk about SIMD instructions, inline assembly and so on, but the reasoning behind when to use these "optimizations" is fundamental.
Modern computing devices are almost always pipelined, which implies that instructions can be executed concurrently. However, this concurrency only works when the concurrent instructions don't have any true dependencies (RAW dependence).
Consider a 5 stage pipeline; each instruction takes 5 units of time (cycles) to execute and there can be 1 instruction in each stage at the same time. This means that instructions can complete at the rate of 1 instruction/cycle. This means that if there are N instructions in a program, the program can be executed in N+4 cycles, ideally.
HOWEVER, if any instruction is dependent on some other instruction that is still in the pipeline, the dependent instruction must be delayed, thereby increasing execution time.
So we have a fundamental lesson: programs with more independent instructions will execute faster.
Does this mean that programs that have inherent dependecies will always execute slowly? The long answer is the field of HPC, but the short answer is; it depends. What is key here is the definition of "independence". 2 instructions are as good as independent as long as they don't occupy pipeline at the same time.
Coming to how algorithm choice relates to this discussion, independence is a matter of waiting. Consider the following two algorithms to add up 10 numbers, stored in an array A.
sum = 0
for i in range(10):
sum+= A[i]
Program 1
sum_1 = A[0]+A[1]
sum_2 = A[2]+A[3]
sum_3 = A[4]+A[5]
sum_4 = A[6]+A[7]
sum_5 = A[8]+A[9]
sum_1 += sum_2
sum_3 += sum_4
sum_1 += sum_3
sum_1 += sum_5
Program 2
Note how program 2 looks a lot more complicated. Both options have the same number of add operations but program 1 has all the adds go to the same output. This means that they cannot be executed at the same time. However, the add operations in program 2 go to different outputs as much as possible.
Try this with a pipeline diagram to see the difference.
You could try to execute this but I'd be willing to bet that the difference isn't as pronounced, due to the overhead of executing the program.
Below are the execution times 3 different variants of matrix multiplication.
All three agorithms come from the same specification. Consider 2 matrices A and B with dimensions MxK and KxN respectively.
Their product, C, is computed by multiplying and adding each element in each row of A with each element in each row of B, producing MxN elements.
This results in the familiar loop structure below.
What do you think about this algorithm? Are the iterations of the loop independent from each other?
Try to work it out for yourselves before uncollapsing the next section
It is easy to see that changing the ordering of the loops does not change the end result. The 3 variants we present come from re-ordrering the above loops as follows. The update statement within the loop stays the same in all cases:
Fully Update - Text book algorithm
Loop Ordering: i, j, k
Each element of C is fully computed before moving on to a different element. Assuming a Row-Major ordering for A and B, the accesses to A are sequential while those to B are strided. The iterations of the loop are not independent
Fully Update - Loops switched
Loop Ordering: j, i, k
Each element of C is fully computed before moving on to a different element. Assuming a Row-Major ordering for A and B, the accesses to B are sequential while those to A are strided. The iterations of the loop are not independent
Partial update
Loop Ordering: k, i, j
Each element of C is only <b>partially</b> computed before moving on to the next element. Assuming a Row-Major ordering for A and B, the accesses to A are sequential while those to B are strided. Each loop iteration computes a different output, so they are independent!
And so we see that an algorithm that exposes independent operations will run faster than one does not, even without consideration for things such as register allocation, vectorization, tiling etc. These elements will further improve the execution time of the independent algorithm, but the independence is fundamental, and is the easiest thing to get started with. One caveat/warning: while it is the most accessible, thinking about independent operations takes some getting used to. The partial update formulation of matrix-multiply seems completely convoluted from a math/logic perspective.