Code Optimization

Note: Please refer to the Appendix if you are unable to understand the terms and want to get basic insight about the topic.

Code Optimization

Code optimization informally denotes the program that is transformed into a better form i.e. the run-time performance of the program code is improved without altering its functionality.

Memory space for a code is not a concern for most practical purposes; however, the consumption of the instruction cache can affect speed. Run-time performance may involve reduction in number of instructions, replacement of expensive instructions with cheaper ones, and implementation of vectorization, multi-threading, and parallelism.

After the development of efficient algorithm in terms of Big-O notation, the code may be optimized via execution profiling i.e. measurement of hotspots after the auto optimization from the powerful compilers through the use of appropriate compiler optimization flags.

Algorithmic Efficiency & BIG-O Notation

In terms of algorithmic efficiency, a compiler can only make improvements on constant factors. So, the first step in optimizing your code is to select efficient algorithm in your code. BIG-O notation is used to classify algorithms for their performance (processing time and/or memory space requirements) when input sizes (n) vary i.e. it characterizes the algorithm according to their growth rate. The sorting algorithm "quicksort" noted as O(nlogn) is always more efficient than "selection sort" denoted as O(n2) irrespective of code optimization effort. 

Compiler Optimization

The –O<level> flags and other optimization flags in a command line of a compiler including processor-specific flags enable different level of optimization. The respective documentation for GNU, Intel, PGI, and other compilers that are also installed in HPC have specification for the flags.  For example, for Intel, -O2 flags enables optimization for speed where 2 is the recommended optimization level. Here, the compiler vectorization is enabled along with other basic loop optimizations, inlining of intrinsics, Intra-file Inter Procedural Optimization (IPO). Profile Guided Optimization (PGO) flags in Intel compiler use profiling tool to recognize the hotspots and improve performance accordingly. 

The improvement of performance and/or code size, however, is achieved at the expense of compilation time and the ability of compiler to debug the code. The compilers are designed to incorporate most of the optimization techniques as described in section “Optimization Methods”. To understand, optimization, you may also want to refer to the sections in "Appendix".

Keep in mind that compiler won’t replace bubblesort algorithm with quicksort for optimization purpose.

Code Profiling

Code Profiling helps in identifying algorithmic bottlenecks (hotspots), and also communication overheads in case of distributed code. Most of the execution time can be spent in the function which is repeatedly calling other slower functions or there can be slower pieces in the code. Loops are the most common source of hotspots due to their repetitive nature. Consider the loop below:

for (int i=0; i<4; i++){

    sum+=array[i];

}

It consists of 1- assignment, 4-compares, 4-increments, three branches and one branch misprediction. Loop can also add extra dependencies (data and memory) even when the processor is able to internally unroll loop for instruction parallelism.

Compiler like GNU uses gprof flag for profiling, Intel has Intel Vtune Amplifier XE, and similarly, there are profiling tools for other compilers as well. The valgrind utility is one of the popular open source profiling tool. TotalView is the parallel profiling tool. MATLAB uses profile and mpiprofile for profiling serial and parallel code respectively.

Parallelism

Bit-Level Parallelism

With the advancement of VLSI technology, processor word size has been increasing which has resulted in performance improvement. For example, 16 bit processor can add two 16-bit integers faster than the 8-bit processor as there is overhead in dividing 16-bit integer into two upper and lower bits.

Instruction Level Parallelism

It is achieved through  Instruction Pipelining. Compilers and processor designs help to overlap the execution of multiple instructions or change the order of execution so as to minimize the dependencies among different operations. The techniques of Instruction Level Parallelism are as follows:

Explicitly Parallel Instruction Computing (EPIC) uses multiple execution units (circuitry) for multiple instructions to be executed in parallel. Superscalar architecture executes more than one instruction during a single clock cycle by dispatching multiple instructions where as VLIW (Very Long Instruction Word) architecture, allows programs that can explicitly specify instructions to be executed simultaneously.

Out of Order Execution/Register Renaming allows multiple instructions to be executed out of order; the data dependencies taken care of by Register Renaming. It prevents unnecessary serialization of program operations imposed by the reuse of registers by those operations.

Speculative Execution allows the execution of the complete or part of the instructions irrespective of whether or not it should take place. Branch prediction uses speculative execution.

Vectorization

It is a special case of SIMD (Single Instruction Multiple Data) defined in Flynn’s architecture taxonomy where a single vector or SIMD instruction is capable of operating on multiple data elements in parallel. Modern x86 chip has Streaming SIMD Extension (SSE) instruction set architectures. Auto-vectorization is performed automatically by compilers after enabling appropriate compiler flags; O2 or higher in Intel compiler. In explicit vectorization, SIMD code can be produced manually for specific application. The vectorizing compiler transforms the loop as showed below:

for (int i=0; i<n; i++)

     c[i] = a[i] + b[i];

into a sequence of vector operations as showed:

C1 = a1 + b1

C2 = a2 + b2

Cn = an + bn

Threading/Multi-threading

This is the Thread Level Parallelism (TLP) where threads are instantiated and executed independently. Multi-threading keeps CPU busy by engaging multiple threads.

Simultaneous Multi-threading or Intel Hyper Threading:  It is based on superscalar CPU technologies. Instructions from more than one thread can be executing in any pipeline stage at a time. The practical restrictions on chip complexity have limited the concurrent threads to two.

Temporal Multi-processing: In contrast to simultaneous multi-threading, temporal multi-threading can have only one thread in any given pipeline stage in a given cycle. In Course Grained Multi-processing, the main processor pipeline contains only one thread at a time and the processor must effectively perform rapid context switching (thread switch) before executing a different thread; e.g. intel itanium 2. In Fine Grained Multi-processing, the processor pipeline may contain multiple threads with context switching occurring between pipeline stages; e.g. Sun UltraSPARC

Multi-Core Parallelization: Process level parallelism, a true parallelism, extends well beyond multi-threading and hyper threading. Threads can be executed concurrently and independently as separate processors or cores. In Flynn’s taxonomy, multiprocessors are categorized as MIMD (Multiple Instruction, Multiple Data). 

In Tightly Coupled, the processors are connected at the bus level and they have access to central shared memory (Shared Memory Processor (SMP) or Uniform Memory Access (UMA)), or may participate in a memory hierarchy with both local and shared memory (Non-UMA (NUMA)). In UMA model, all processors share the physical memory uniformly. Each processor, however, may use a private cache. In NUMA (Non-Uniform Memory Access) model, memory access is spatial i.e. it depends on the memory location relative to the processor. Loosely Coupled uses multiple standalone computers having different operating systems interconnected via high speed communication infrastructure(e.g. clusters).

Loop Scheduling: Parallel compilers can perform loop scheduling dividing a loop into multiple parts that can run concurrently on multiple processors. In MATLAB, the task is divided into different MATLAB workers (see  HPC MATLAB GUIDE).

Accelerated Computing

It involves hardware acceleration that outperforms some software functions running on general-purpose CPUs. It is designed for computationally intensive software code having hot spots requiring a large number of algebraic operations such as Medical Imaging (ultrasound, CAT scanning, MRI), real time modeling and simulation in financial decision making, complex algorithms (CODEC, data compression, warehousing, and encryption), and so on.

Intel Xeon Phi & GPUs (CUDA) are the popular hardware acceleration options that are revolutionizing HPC providing large number of processing cores and internal memory working in conjunction with the CPUs. Both Xeon Phi and GPU devices are cards that interface with CPU through PCI-express bus. NVIDIA GPUs are designed to solve problems implemented as Single-Instruction-Multiple-Threads (SIMT) model. In contrast to Intel Xeon Phi which is less specialized architecture for parallel programmers familiar with Pthreads, OpenMP, MPI, CUDA, and OpenCL, GPU is specialized architecture with steeper learning curve.

 All codes are not best fit for hardware accelerators. It can be the candidate for significant performance enhancements if following condition applies:

There are cases where accelerator-specific programs/directives when included in a code can be interpreted by compilers (OpenACC, OpenMP, PGI). There are also accelerator enabled libraries such as CULA available. Moreover, some applications such as MATLAB, Jacket, GROMACS etc  are accelerator-aware applications.

Optimization Methods

Most of the the optimization methods described below are taken care of by compilers but it is still worth the reading to understand the techniques.

Dead code Elimination or Dead Code Stripping: Dead code is the unreachable code that can never be executed and which only affects dead variable. Consider the code snippet below:

int dead_code () {

    int alive_var = 32;

    int dead_var = 10;

    int shift_val;

    shift_val = alive_var << 2;

    return shift_val;

    dead_var = 32;

return 0;

Here, local "dead_var" is the dead variable as it has not been used inside the function "dead_code()" and will never be used. Also, "dead_var" variable is re-assigned after the return without if/else so it is not going to be executed rendering it as a dead code. Compilers can recognize the dead code, reclaim its storage space, and eliminate its initialization.

Common Sub-Expression Elimination: It is about reusing the value from the expression to another expression so as to avoid recalculation unless it is not stored in or loaded from memory. The expressions:

X = sqrt(A) * cos(a);

Y = sqrt(A) * sin(b);

can be re-written as:

B = sqrt(A);

X = B*cos(a);

Y = B*sin(a);

Loop-Invariant Code Motion or Hoisting of invariant from loops: The loop invariant code consists of expressions which is independent of loop variables and therefore can be moved outside of the loop to be executed only once. Consider the code snippet below:

for (int i = 0; i < n; i++) {

    x = y + z;

    a[i] = k * i + x * x;

}

Here, the instruction "x = y+z" is executed in each iteration though it is not necessary to do so. It can be safely taken out of the loop as showed below:

x = y + z;

x_2 = x * x; //Expression Elimination

    for (int i = 0; i < n; i++) {

    a[i] = k* i + x_2;

}

Loop Unwinding or Unrolling: It is the loop transformation that reduces the loop count combining two or more loops in the expense of its binary (executable) size. The following loop is transformed from:

for (int i = 0; i < 100; i++) {

    add(i);

}

to:

for (int i = 0; i < 100; i+=5) {

    add(i);

    add(i+1);

    add(i+2);

    add(i+3);

    add(i+4);

}

Here, the iterations have been reduced from 100 to 20 which reduces the number of jumps and conditional branches decreasing loop overheads.

Jump Removal/Branch Elimination: Jump and branches are expensive so by eliminating and minimizing them, the performance can be improved. There are two branches to L1 and L2 in the first code fragment as showed:

goto L1

f()

L1:

goto L2

It is replaced with only one branch to L2; The statement "goto L1" is eliminated as showed below:

goto L2

f()

L1:

goto L2

Constant Folding: Compiler can evaluate the expressions with constant operands at compile time and replace it with the constant which can improve run-time performance reducing code size. The expression in return "15*4" as showed below:

F (){

    return 15 * 4;

}

is replaced by the constant 60 (=15*4) as showed:

F (){

    return 60;

}

Inlining: This is another place to look for optimization where the function call is replaced by the body of the function definition. Inlining is done automatically by many compilers but can be manually specified via compiler directives using keyword like inline. Every inlined function is faster due to (i) elimination of branches which improves cache’s locality of reference, (ii) removal of function call and return instructions along with other prologue and epilogue code introduced into every function by the compiler, and (iii) addition of intra-procedural optimization. But the performance can be degraded with the use of multiple copies of functions which can increase the code size enough not to fit in the cache resulting in cache misses. Prologue is the lines of code at the beginning of the function to prepare the stack and registers for use within the function whereas epilogue restores the stack and registers to the previous state when the function is called.

Trace Scheduling: This is the compiler optimization where a compiler can rearrange its generated machine instructions for faster execution improving program performance. Trace includes instructions along with branches except the loops that is executed for input data.

Arithmetic Simplification: It is about representing the algebraic expressions in their simplest form. For example, a[0] = i + 0 can be simplified by a[0] = i. The compilers attempt to find ones with the smallest set of instruction.

Tail call optimization: It avoids allocating a new stack frame for a function as the calling function returns the value that it gets from the called function. It is commonly used for tail-recursion - tail call calling the same sub-routine, where a recursive function can use constant stack space. The same amount of space is used for bigger inputs as well which can avoid stack overflow. Let’s consider the recursive factorial function where procedure call itself last in the control flow as showed:

fact (x){

    if (x == 1) return x;

    else return x + factorial (x-1);

}

Here, the compiler will evaluate fact (3) as:

3 + fact(2)-> 3 + (2 + fact(1)) -> 3 + (2 + 1) = 6

Now, let's implement tail recursion as showed:

tail_fact (x,total=0){

    if (x==0) return total;

    else return tail_fact(x-1, total + x);

}

The compiler produces output of  tail_fact(3) as:

tail_fact(3,0)->tail_fact(2, 3)->tail_fact(1,5)->tail_fact(0,6)->6

Here, the calling function (tail_fact) is returning the value it gets from the called function.

Iteration order reversal or Loop Termination:  Iterating from low to high number requires a comparison at the end of each iteration of the loop. Branch zero instructions, loop ending on a zero, are computationally faster with the elimination of comparison. Consider the factorial functions using incremented loop as showed:

int fact_incr(int n){

    int fact = 1;

    for (int i = 1; i <= n; i++)

        fact *= i;

        return (fact);

}

It can be implemented as decremented loop as showed:

int fact_decr(int n){

    unsigned int i, fact = 1;

    for (i = n; i != 0; i--)

        fact *= i;

        return (fact);

}

When you compare the compiled code, the assembly for incremented loop consists of ADD & CMP which is substituted by SUBS in decremented loop as compare with zero is optimized. Also, the variable "n" does not need to be saved (no need to compare with it) across the loop while decrementing. It is also better to use the counter of type unsigned int to reduce signed number overhead.

Induction Variable Elimination: The induction variables are the ones that get increased or decreased by a fixed amount on every iteration of a loop, or is a linear function of another induction variable. The compiler can optimize the code by replacing them with simpler computation.  The multiplication computation of induction variables i and k in the code below:

for (i=0; i < 10; ++i) {

    k = 501 * i;

}

can be replaced by much simpler addition operation as showed:

k = -501;

for (i = 0; i < 10; ++i) {

    k = k + 501;

}

Loop Fission/Distribution: It is the compiler optimization in which the loop is broken down into multiple loops over the same index range so as to engage multiple processors into multiple tasks. It makes use of locality of reference. The computation tasks for assigning array a and b are distributed from:

for (int i = 0; i < 100; i++) {

    a[i] = 0;

    b[i] = 0;

}

to:

for (int i = 0; i < 100; i++) {

    a[i] = 0;

}

for (int i = 0; i < 100; i++) {

    b[i] = 0;

}

Loop Fusion/Combining: It is just the opposite of Loop Fission and is applicable in the architecture where the data locality within each loop is not significant and no multiple processing occurs.

Prefetching: Prefetching is the mechanism which brings data that may be needed by the processor in the near future. Hardware Prefetch involves no compiler support. Pentium IV  processor has automated hardware prefetch. So, no need to use software prefetch as hardware prefetch is much more efficient. Software Prefetch works in tandem with hardware. When the bus bandwidth is available, the process can start loading the memory into the cache before it is needed.  Different prefetch instructions can be used depending on the algorithm.  If you want to prefetch for data 16 loop iterations into the future in Pentium 4, the code snippet looks as below:

for (i=0; i<1000; i++) {

    X = fn(array[i]);

    _mm_prefetch(array[i+16], MM_HINT_T0);

}

Memory Bank Optimization: Memory contention is caused by successive accesses to the same memory bank as each access to the data bank causes the bank to be unavailable for some number of clock periods. Each bank also has a limited bandwidth that causes large slow down if the data elements to be accessed are on the same  bank. The number of banks is almost always a power of two in size. Refer to Appendix C "Data Storage & Memory Bank" for details.

By increasing the size of the array more than the power of two in size (array padding) and through loop interchange, slowdown of memory banks can be prevented. There is also a hardware technique that solves the problem by mixing up data in the banks using a method called “Chinese Reminder Theory”. 

Consider a code snippet below:

float a[100][128], b[100][128]; //Array size: 128 = 2^6 i.e. power of 2

for (i=0;i<128;i++) /* Avoid this order */

    for (j=0;j<100;j++)

        a[j][i] = b[j][i] * 10;

"for loops" for i and j are interchanged below:

float a[100][128], b[100][128];

    for (j=0;j<100;j++) //loop Iiand j are interchanged

        for (i=0;i<128;i++)

            a[j][i] = b[j][i] * 10;

The array size is increased by 1 (i.e. 129 = 2^7 + 1) so that the value is no longer the power of 2 as showed below:

float a[100][129], b[100][129];//2^7 is increased by 1

for (i=0;i<128;i++)

    for (j=0;j<100;j++)

        a[j][i] = b[j][i] * 10;

Instruction Combination:  It is optimization process that combines two or more instructions into one. Consider the following code snippet where there are two x++ instructions.

int x;

increment(){

    x++;

    x++;

}

They can be combined into one as showed below:

Int x;

increment(){

    x+=2;

}

If Optimization: The fragmented IF statements as showed below: 

F(int value){

    if (value) f1();

    if (value) f2()

}

can be consolidated into one IF statement.

F(int value){

    if (value)

        f1();

        f2();

}

Register Allocation: It is the optimization process where the compiler assigns a large number of target program variables onto a small number of CPU registers. Since not all variables are in use at a particular time, so some registers can be re-used for more than one variable which can prevent accessing significantly slower RAM. Also, partial registers as in Intel Pentium 4 provides more registers but the register stall has to be taken care by the compiler. Register Stall occurs when a write operation on a partial register is followed by the use of full register as showed:

Mov ax, 20 ; loads only the lower 16 bits

Move variable, eax ; stalls trying to write 32-bit in the same register EAX.

Instruction Scheduling: It optimizes modern pipelined processors avoiding stalls or bubbles in the pipeline by clustering instructions with dependencies together making sure that the original semantics are being preserved.

Bubble or pipeline stall is a delay introduced in execution of an instruction because of hazard; data available after Memory stage of the first instruction if required as input by the Execute stage of second instruction in Classic RISC pipeline.

Data Alignment and Padding: It is an important issue for all programmers who directly use memory. Programmers are conditioned to think memory as a simple array of bytes.  The processors does not read/write byte-sized chunks but in 2, 4, 16 or even 32-byte chunks which is defined as Memory Access Granularity which minimizes the number of accesses and can help in performance. However, reading from unaligned address increases the access count. Consider 2-byte memory access granularity in Fig. 1. Reading from address 0, takes half the number of memory access using 2-byte granularity compared to processor with 1-byte granularity i.e. performance improvement can be achieved by increasing memory access granularity. However,  if there is a need to read from the address 1 where the address does not fall evenly on the processor’s memory access boundary, the processor has to do extra work to take care of that un-aligned address.  

Fig. 1: Data Alignment Issue

To align the data, it may be necessary to insert some meaningless bytes between the end of the last data structure and the start of the next called data structure padding. Recent compilers can handle it automatically. Let’s consider a C code snippet in 32-bit platform. 

struct A{

char a;

int b;

char c;

} B;

Before padding, the size of the structure is 1 + 4 + 1 =6 bytes. After inserting the gaps by compiler as showed below, the size of the data structure is 12 (multiple of 4).

struct A{

char a;

char gap_0[3]; //inserted by compiler for alignment of b

int b;

char c;

Char gap_1[3]; //for alignment of the whole structure

} B;

However, compiler can further optimize it by re-arranging the field alignments in the order as int, char, char with the total space of 4 + 1 + 1 + 2 (padding) = 8 bytes.

Structure of Arrays Over Array of Structures: To take full advantage of SIMD technology capabilities, replace arrays of structures:

typedef struct{

folat a,b;

int x,y;

} my_struct;

my_struct my_structures[100];

with Structure of Arrays:

typedef struct{

folat a[100],b[100];

int x[100],y[100];

} my_struct;

my_struct my_structures;

Loop Invariant Branches: Loop can be optimized if the branches inside the loops can be avoided which helps compiler make use of SIMD. If branch in the code below:

for (int i=0; i<size; i++){

    if (a = value1)

    …

    else if ..

    else ..

}

can be safely taken out from the loop as showed:

if (a = value1)

    for (int i=0; i<size; i++)

    … 

else if …

    for (int i=0; i<size; i++)

else …

    for (int i=0; i<size; i++)

Operations with Power of 2: The operations can be optimized if implemented as  bit shifts or bit masking. Consider a mod operation where 2 & 4 are powers of 2 i.e 2^1 and 2^2. Now, N mod 2 can be carried out via bitwise Anding instead of complex division as showed:

101 mod 2 -> 101 & (2 - 1) -> 01100101 & 00000001 = 1

101 mod 4 -> 101 & (4 - 1) -> 01100101 & 00000011 = 1 

Appendix

A. Computer Architecture

Computer architecture describes the computer system by specifying its parts and their relations.

Von Neumann Architecture

Von Neumann architecture or Princeton architecture derives from a 1945 computer architecture description by mathematician and Physicist John von Neumann et al. in First Draft of a Report on the DVIAC , describes a digital computer with a processing unit consisting of an ALU (Arithmetic Logic Unit) and processor registers, a memory to store both data and instructions, control unit containing an instruction register and a program counter, external mass storage, and I/O devices as showed in Fig. A1:

Fig. A1: Von Neumann Architecture

Von Neumann architecture, as a stored-program computer model, stores program and instruction data in the same electronic memory, read-write, random access memory (RAM) which is different from the program-controlled computers which were programmed by setting switches and inserting patch leads to route data and control signals to various functional units.

Harvard Architecture

It is also stored-program architecture. Unlike Von Neumann architecture, it has separate memories for storing program and data as showed in Fig. A2.

Fig. A2: Harvard Architecture

Most processors such as C55x processor, AVR & PIC technology implement such separate signal pathways for performance reasons i.e. CPU can both read an instruction and perform data memory access at the same time. In a modified Harvard Architecture, loading a program from disk storage as data and execution of it is possible.

Example: Basic Pentium Processor Architecutre (Von Neumann based on CISC CPU design)

Let's try to understand the basic building block of Pentium Processor (Figure A3) so as to locate the area for optimization.

Fig. A3: Basic Pentium Architecture

Busses: They are the communication paths. Control Bus is used by PC to communicate with other devices within the computer and carries commands from the CPUs and returns status signals from the devices. Address Bus specifies the physical address of the device the CPU or DMA-enabled device is communicating with. A 32-bit address bus system can address 2^32 memory locations. Data Bus carries the actual data to be processed.

N-bit processor represents the N bits for datapath widths, integer size, and memory addresses widths. N is the word size that is handled as a unit by the instruction set or the hardware of the processor. For example, N = 64 for Intel x86_64. So, each register can store 2^64 (~ 1.8 x 10^19) different values, can access 2^64 bytes addressable memory. Processor can have external data bus and address busses with different sizes from the registers.

 Caches & Buffers: They provide smaller but faster temporary storage. Branch Target Buffer (BTB) are fixed-size hash tables or cache that maps from Program Counter (PC) to branch target address which reduces the performance penalty of branches by predicting the path of the branch and caching information about the branch such as PC of prior branches, PC of corresponding target, and prediction info. Translation Lookaside Buffer (TLB) is a cache of a page table that improves the virtual address translation speed. The page table keeps track of where the virtual pages are stored in the physical memory. Prefetch Buffer is a data buffer on DRAM chips that allows quick and easy access to multiple data words. When a memory access occurs to a row, the buffer grabs a set of adjacent data words on the row assuming that CPU access these adjacent data words. A 64-bit CPU accessing a 16-bit (n: IO width of memory chip) DRAM chip uses 4n (e.g. 4x16) prefetch buffer. 

Processor Cache - Data Cache and Instruction Cache, stores copies of recently and frequently used data from main memory so as to reduce the average time to access memory. The data and instruction cache are somewhat different in terms of circuitry – data cache needs both reads and writes whereas instruction cache deals with only reads. The annotations for decoders can be implemented for info such as where the next instruction starts. 

Pentium 4 has L1, L2, and L3 (typically for servers and optional) caches. Level 1 cache (L1) is used for data with trace-buffer for instructions.  Trace buffers stored traces of instructions that have already been fetched and decoded.The second level (L2), unified cache is used for both instructions and data. L3 cache is shared by all cores. In a cache miss, the processor fetches 64-byte chunk into the cache expecting the extra memory to be used shortly based on spatial and temporal locality. For example, 1 transaction of 64 bytes is much faster than 64 transactions each for 1 byte.

Caches are loaded when there are (i) compulsory load, (ii) conflicts, and (iii) capacity misses. Compulsory load happens when data is loaded for the first time in the cache. This can be reduced but is inevitable. It is difficult to account for all the memory that a program uses due to function calls, stack usage, OS tasks, and so on, still chaning the application to access less memory is the way to reduce the number of compulsory cache misses. If the cache is smaller than the working set of algorithmic data, the data that was initially in the cache need to be re-loaded. So, it is better to operate on a smaller chunks of data that do fit in a cache which is called stripmining.Cache conflicts occurs due to the organization of cache as cache row can hold only specific memory addresses. This happens frequently in image-processing algorithm. Organized data structure such as assigning a contiguous memory for elements (zeros function in MATLAB) can result in better cache efficiency. If the variable (e.g. double type) is split across two cache lines, there can be a data alignment issue. Casting variables as showed below can cause data alignment issue.

double* array = (double*)malloc(48*sizeof(double)); //Casting variable

Processor Registers: They are small amount of storage that are addressed differently from main memory and can be accessed comparatively very fast, positioned at top of the memory hierarchy. They are like variable built in processor which makes the processes faster and cleaner. Processor registers are used to load large memory data for processing and temporarily store processed data before storing back to the main memory (RAM) implicitly accessed via one or more cache levels.

There are different categories of registers. Registers like data, address, General Purpose, Integer, floating point, vector, control, and special purpose are user-accessible registers whereas Instruction registers - registers related fetching information from RAM, and collection of storage registers located on separate chips are inaccessible registers.

In Pentium 4 processor, there are 8-bit, 16-bit, and 32-bit size processor registers organized as follows:

32-bit General Purpose Registers (GPR) – EAX, EBX, ECX, EDX

Each GPR can be splitted to the least significant 16-bits – AX, BX, CX, DX which can further be split into upper (denoted by H) and lower (denoted by L) half of size 8-bits as showed in Fig. A4 (for EAX).

Fig. A4: General Purpose EAX registers; same holds for other register series

Segment Registers

They hold the segment address of various items when using multi-segment programming.

Indexes & Pointers

They cover indexes, pointers, and offset part of an address. The register with an “E” prefix can be only used in protected mode.

EFLAGS Register

It is the 32-bit register that holds the state of the processor as showed below. It is modified by many instructions and each bit holds the status.

Bit Label Desciption

---------------------------

0 CF Carry flag

2 PF Parity flag

4 AF Auxiliary carry flag

6 ZF Zero flag

7 SF Sign flag

8 TF Trap flag

9 IF Interrupt enable flag

10 DF Direction flag

11 OF Overflow flag

12-13 IOPL I/O Priviledge level

14 NT Nested task flag

16 RF Resume flag

17 VM Virtual 8086 mode flag

18 AC Alignment check flag (486+)

19 VIF Virutal interrupt flag

20 VIP Virtual interrupt pending flag

21 ID ID flag

Instruction Decode: The decoder interprets the instruction inside the instruction/processor registers in decode stage of the pipeline. 

Microcode ROM: It is a hardware-level instruction that deals with circuit-based operations. It is stored in a ROM (Read Only Memory) or in EPROM (Erasable Programmable ROM) and hence, can not be easily modified by generic programmers.

Units: Different units have specific functionalists. Central Processing Unit (CPU) executes instructions by performing the basic arithmetical, logical, and I/O operations. The typical components of CPU are ALU (Arithmatic Logic Unit), FPU (Floating Point Unit), and control unit (CU). ALU is a digital circuit that perform integer arithmetic and logical operations.  FPU is also called the math co-processor as it carries out floating point operations. Both ALU and FPU have their separate register file containing array of processor registers. FPU contains status (e.g. iov: integer overflow) and control information (dyn: dynamic rounding) in control register. CU provides timing and control signals for the operation of the processor controlling communication and co-ordination between I/O devices . It also reads and interprets instructions. Bus Unit connects the major components of processor.

Barallel Shifter: It is a digital circuit than can shift a data word by a specified number of bits in one clock cycle.

B. CPU Design or Instruction Set Architecture

It deals with architecting programming part of the digital computer such as data types, instructions, registers, addressing modes, memory architecture, interrupt and exception handling, and external I/O. It specifies the op-codes (Operation codes) for various operations (add, move, store, load, jump etc) along with format and specification of operands. Operands can be of types - implicit, empty, register values, stack values, memory values, I/O ports and others depending on the CPU design.

Complex Instruction Set Computer (CISC): As a name suggest, each instruction in a code can execute several operations and can take multiple clock cycles i.e. number of operands varies. Examples are Intel x86, Motorola 68k etc.

Reduced Instruction Set Computer (RISC):  It is designed With the insight of providing higher performance by simplifying the instruction set where each instruction can be executed after every clock cycle i.e. make use of pipelining. For examples, ARM, PowerPC, SPARAC etc.

Instruction Encoding/Formatting

The operation to be performed by an instruction is represented in bits format, the size of which depends on variety of addressing modes and CPU design.  These bits are converted into datapath signals by control units. CISC processor provides variety of addressing modes allowing constants and other operands either in memory or registers. So, the instruction size depends on the size of the operands. The general instruction formats for register-register and memory-memory addressing mode is showed in Fig. A5.

Fig. A5: (a)Register-Register Addressing (b)Memory-Memory Addressing

Except load and store instruction to access memory, all other operations in RISC are Register-Register operations. The feature enables instructions to be executed per cycle. CISC is motivated by supporting efficient access to complex data structures, providing flexibility to access operands. As an example, Pentium provides based-indexed addressing with scale factor to access complex data structures such as multi-dimensional array. General Instruction format for Intel Pentium is showed in Fig. A6.

Fig. A6: General Pentium Instruction Format; Instruction Opcode (e.g jmp, load, store etc) can be 1 or 2 bytes; Mod-R/M specifies the addressing “Mode(M) – Register (R)/Memory (M)” and instruction operand size if any; Optional Scaled Indexed Byte (SIB) for complex data structure such as multi-dimensional arrays if used; Displacement specifies Memory Address Displacement; Immediate represents the constant data if the instruction has an immediate operand. 

Example: Multiplication Operation

Let's consider an execution of Multiplication instruction to differentiate RISC and CISC design. If you are multiplying the data a and b that resides in the memory and store the result back to the memory as showed in Fig A7., RISC uses multiple instruction sets to load, multiply, and store the data.

Fig. A7: Execution of Multiplication

RISC Instructions:

LOAD A, a ; load data “a” into the register A

LOAD B, b ; load data “b” into the register B

PROD A, B ; Multiply “a” and “b”

STORE 5:3, A; Store the product in the memory

Here, “a” and “b” are the data located at memory location 2:2 and 5:3. Data “c” is the product that is stored at memory location.

In CISC architecture, you just need one instruction. It implicitly loads and stores acting as high level language statement into assembly.

MULT a, b

So, in CISC, the length of the code can be sufficiently smaller and compiler has to do a little work to translate the code into assembly and it requires less RAM to store instructions on the expense of more transistors of hardware for control logic. So, in addition to RISC in a single chip (VLSI technology), it has rooms for more registers, on-chip cache, and more processing units as the VLSI technology is improved. The separate load and store instruction in RISC also reduces the amount of work as the value of the operand in the register can be re-used for another operation rather than re-loading it.

Let’s compare RISC and CISC based on the Performance equation:

Time/program = time/cycle x cycles/instruction x instructions/program

RISC tries to reduce the cycles per instruction at the cost of instructions per program and CISC is just the opposite.

C. Data Storage & Memory Bank

Data is stored in a memory according to endian - a convention to represent the bytes in data word when they are stored in computer memory. Big-endian system stores the most significant byte of a word in the smallest address where as Little-endian stores the least significant byte of a word as showed in Fig. A8. Here, the word (e.g. 32-bit integer) consists of 4 bytes. Each memory space, however, can accommodate only one byte, and hence are stored into 4 consecutive memory locations, a, a+1, a+2, and a+3 depending on Big or Little endian. 

Fig. A8: Endian Representation of data storage in a memory

Memory Bank (Fig. A9) is a logical unit of storage. It provides independent access to data – separate addresses and data lines. It is useful for multiprocessor system, I/O , and CPU cache. In SDRAM (Synchronous Dynamic RAM) and DDR (Double Data Rate) SDRAM, a bank consists of multiple rows and columns of storage units which are usually spread out across several chips. A quantity of memory that is wide enough to match the bit width of the data bus is called a bank of memory. In a single read or write operation, only one bank is accessed. Therefore, number of bits in a column or row/bank/chip is the memory bus width or single channel.

Fig. A9: Multi-bank Architecture – banks, rows and columns; the requested row is activated and is copied to the row buffer of the corresponding bank

Many chips are combined on a memory module as showed in Fig. A10  to increase the word width.

Fig. A10: Memory Module

Let's consider a scenario of storing array data in a memory bank. Row-major order (C, C++, etc) and column-major order (FORTRAN, MATLAB, etc) describes methods for storing multidimensional arrays in linear memory. Data in 2X3 array (1 2 3; 4 5 6) is laid out contiguously as 1 2 3 4 5 6 and 1 4 2 5 3 6 in linear memory for row-major and column-major storage respectively. The adjacent element in the column for row-major are 3 words apart (2X3 dimension) i.e. stride of the array is 3 word bytes. Accessing array elements that are contiguous in memory is usually faster due to caching.

If a program declares an array float a[200], the elements a[0] if happens to be in bank n, a[1] would be in bank n + 1, and so on. So, when adjacent addresses are accessed, each bank services them, resulting in good utilization. In two dimensional array elements a[0][0], a[1][0], a[2][0], … are all in the same memory bank.

 D. Instruction Pipelining

Instruction pipelining allows overlapping execution of multiple instructions with the same circuitry in stages optimizing the instruction throughput. In the 5-stage pipeline as showed in Fig. A11, five instructions are at different stages at  different clock cycles.

Fig. A11: 5-stage Pipeline in RISC; IF: Instruction Fetch, ID: Instruction Decode, EX: Execute, MEM: Data Memory Access (read), WB: Register Write Back

Reducing data dependencies is the important part for optimization besides keeping the pipeline stages as busy as possible. The stages in original Pentium processors are prefetch, decode1, decode2, Execute, and Write Back as showed in Fig. A12

Fig. A12: Pipeline stages of original Pentium Processor

Instruction Fetch/Prefetech: The instruction is fetched from the cache to be ready for the next clock cycle.

Decode: In this stage the instruction is decoded into opcodes and data. In Pentium 4, the decoders are ordered as Decoder0 -> Decoder1 -> decoder2.  On every clock cycle, the decoder tries to decode three instructions that can happen only if the three instructions (one complex and two simple instructions) matches the capability of decoders i.e. three instructions that schedules to 4-1-1 can be decoded in one clock cycle. So, here is the opportunity of optimization by re-arranging instructions. The unparable instructions (NP) and pairable (UV) instructions are categorized first. The instructions issued on U-pipe are then paired with a suitable instruction in the V-pipe. The intel optimizing compiler can schedule code for both Pentium III and IV processors providing efficient instruction decoding.

Execute: In this stage, the instructions ready to be executed are searched and executed in the fastest possible order employing available ports connected to execution units as showed in Fig. A13. 

Fig. A13: Dispatch ports and Execution Units in Pentium 4; clock x2 implies that it can execute two micro-instructions per clock cycle; CPU can execute upto 7 micro-instructions simultaneously (port 0 and port 1); two double speed ALUs implies that Pentium 4 is optimized for simple operations.

Port 0 is used during store operations to send the data to be stored at the address which is generated by either ALU or FPU depending on the data to be stored. Port 1 can execute simple instructions even when FPU is busy. Port 2 and port 3 are dedicated for memory operations so as to avoid memory latency which can be much longer depending on the state of the cache.

The execution stage can therefore be optimized with the good blend of instructions which can keep all the ports busy through low dependency. Processing speed of Instructions can be specified by Instruction Latency and Instruction Throughput. Instruction latency is the number of clock cycles required to complete one instruction once the inputs are ready and execution begins. Instruction Throughput is the number of clocks cycles the processor is required to wait before starting the execution of identical instruction. Instruction Performance for Pentium 4 is tabulated in Table 1.

Table 1: Approximate Instruction Performance in Pentium 4

Consider the operation, 

a = w*x*y*z

Referring table 1, w*x operation and y*z, both integer multiplication operation takes 15 clock cycles (Instruction Latency), however, the second operation can start after 5 clock cycle delay (Instruction Throughput). Now, the third operation wx*yz has to wait for both previous operations to be completed and, therefore, can start only after 20 clock cycles resulting in the total of 35 clock cycles.

Now, consider the C code fragment that has dependencies:

a= 0;

for (i = 0; i < 1000; i++)

    a +=buffer[i]

Here, though increment “i++” and “a +=buffer[i]” can occur at the same time, the subsequent iterations are dependent on the previous operations (updated value of a). However, more operations can be run at the same clock reducing dependencies if the loop is divided as showed.

a = b = c = d = 0;

for (i = 0; i < 250; i++) {

    a += buffer[i];

    b += buffer[i+250];

    c += buffer[i+500];

    d += buffer[i+750];

}

//collect

a = a + b + c + d;

Here, more instructions are finishing per clock. However, the processor may not be running at its capacity.

Write Back: The results are written back into the register file, and then to the memory if required. The finished instructions are removed from the instruction pool.

Branching & Branch Prediction

Branching requires the processor to handle instructions not-in-order depending on the branch result i.e. whether instructions should be fetched and decoded. To avoid wasting time, branch prediction is implemented. There are branches without randomness that can be predicted by the processor ahead of time. For example, in the code snippet below:

for (int a=1; a< 100; a++)

    if (a%2 ==0) even (a) else odd(a);

the processor executes even (a) and odd (a) statements alternatively. CPU uses branch prediction algorithm to predict even the complicated branches. However, if flipping of coin (randomness) is introduced in a code as showed:

for (int a=1; a< 100; a++)

    if (flip_coin() == 0) even (a); else odd (a); // o -> head (even); 1->tail (odd

the branch becomes unpredictable and branch prediction itself becomes an overhead. So, for better performance, try to avoid the random branches. Profiling tool can be used to determine the mis-predicated branches.

References: