Simple (Add, Subtract, Absolute Value, Negation): At least 5x faster than the naïve implementation.
Matrix Multiplication: At least 95x times faster than the naïve implementation,
Matrix Power: At least 1700x times faster than the naïve implementation.
Removing abstraction barriers to avoid setting up stack frames.
Manual loop unrolling.
Vectorization with SIMD instructions (assuming the CPU has access to Intel AVX instructions).
Cache blocking and amortized timing for matrix transpose.
Using cache block-friendly indices for matrix multiplication.
Storing data in a 1D array instead of a 2D array for matrices.
Using OpenMP to parallelize loops.