Compilers, Interpreters, Assemblers

Compiler - A compiler is a program that translates the entire high level language program (source code) into a computer machine code prior to execution. The output of a compiler is the program being run.

Interpreter - An interpreter translates and executes a program one statement at a time. The output of an Interpreter is a program executed line by line.

Assembler - Translates a program written in assembly language into machine code. The input to an assembler is assembly language instructions.

Compilation Process

The compilation process is a set of stages that turns source code into executable object code.

The principal stages involved in the compilation process in order to translate a high level language program into machine code are as follows:-

Lexical analysis

Lexical analysis is the first stage of the compilation process. The source code created by the programmer is tokenised for translation into executable code.

Removal of any non-program elements e.g. Comments and unneeded spaces are removed.
- - - It is good programming practice to indent code blocks and use white space to improve a program's readability. Comments should be used to help explain complex parts of the code. While white space and comments are helpful for the human reader, they are not necessary for the executable code, so the compiler removes them during lexical analysis.
Keywords, constants and identifiers are replaced by 'tokens'
- - Once the non-program elements have been removed, the characters are read and each string is analysed. A line of code such as user_name = "gwen" is analysed as follows:

Assign the token identifier to user_name
Assign the token operator to =
Assign the token literal to "gwen"

The process can be represented in a table (the individual items are referred to as lexemes):

Lexeme Token Pattern

user_name Identifier Letter followed by digits or letters

= Operator =

"gwen" Literal Any string between a pair of single or double quotes

Identifiers are checked against sets of rules. For example, they may not be allowed to start with a number or contain certain characters. Reserved words, such as print, can only be used as keyword tokens.

A symbol table is created which holds the addresses of variables, labels and subroutines
- - As part of the lexical analysis stage, a symbol table is produced. A symbol table is used by the compiler to keep track of all of the identifiers that have been declared in the program.
  - Here is some example code:

The information stored by the symbol table also varies by implementation. Other data that might be stored about each identifier could include its size, its location in memory, and whether or not it is static.

Some compilers implement a global symbol table and separate individual symbol tables for storing identifiers within a particular scope, such as within a subroutine. In this case, the symbol table may take a hierarchical structure.

Syntax analysis

Syntax analysis is the compilation stage that immediately follows lexical analysis. From the symbol table generated, tokens are produced. Once tokens have been assigned to the code elements, the compiler checks that the tokens are in the correct order and that they follow the rules of the language.
Tokens are checked to see if they match the spelling and grammar expected, using standard language definitions e.g. BNF-type rules
This is done by parsing each token to determine if it uses the correct syntax for the programming language.
If syntax errors are found, error messages are produced/if no errors are found the compilation process continues

Semantic analysis

Variables are checked to ensure that they have been properly declared and used
Variables are checked to ensure they are of the correct data type, e.g. real values are not being assigned to integers
Operations are checked to ensure that they are legal for the type of variable being used e.g. you would not try to store the result of a division operation as an integer
During semantic analysis, Reverse Polish logic will be used

So far in the stages of compilation, lexical analysis has been used to determine whether the tokens within the program are valid, then syntax analysis has been used to find out whether the tokens have been used to meet the rules of the language. However, there is another stage called semantic analysis which determines whether what has been written actually has a meaning within the language.

Let's say you are defining rules for constructing grammatically correct sentences in English. One rule for a possible valid sentence construct is:

This means that "I eat crackers" and "They play video games" are valid sentences in English. However, if you are a computer and only use syntax analysis to check validity, "You swim cats" and "We bite movies" are also valid, even though within the English language these sentences make no sense.

Semantic errors

Semantic analysis can be used to determine whether the code is valid within a given context.

Here is an example in C# where two variables have been declared:

Semantic error in Python

Code generation

Machine code is generated
Code optimisation may be employed to make it more efficient / faster / less resource intense

Code generation follows the stages of lexical analysis, syntax analysis, and semantic analysis. A separate program is created that is distinct from the original source code. The code generated is the object code, which is the binary equivalent of the source code. This is the executable version of the code, before linked libraries are included.

Here is a 'Hello World' program in C#:

When this C# code is built, an executable file is generated. (When using C# if you want to look for yourself, you can find the .exe file in the folder /bin/debug.) The executable file is a binary file that is intended to be read by the computer. If you attempt to look at it in a text editor, it doesn't make any sense. The text editor has translated the binary into the equivalent Unicode characters because it has assumed that the file contains text (but it does not).

Code generation is a major distinguishing feature between compilation and interpretation; interpreters do not produce a separate executable file.

Video Reference - Stages of Compilation

Page updated

Report abuse