16.2 Translation Software

Files and Resources

Specification

  • show understanding of how an interpreter can execute programs without producing a translated version

  • show understanding of the various stages in the compilation of a program: lexical analysis, syntax analysis, code generation and optimisation

  • show understanding of how the grammar of a language can be expressed using syntax diagrams or Backus-Naur Form (BNF) notation

  • show understanding of how Reverse Polish Notation (RPN) can be used to carry out the evaluation of expressions

Contents

Interpreters Vs Compilers

Interpreters and compilers were looked at last year. Before we jump into the detail, a quick summary:

    • Interpreters go through code line by line, but do not produce any 'object code'. This means that to run the same program again, it must be interpreted once more. This is ideal for debugging code.

    • A compiler will pass the code through a series of steps to produce object code. This means that once a program has been compiled, it can be executed multiple times without the need of a translator.

There is a handy summary table of the different types of code produced by translators within section 1.5.4 here.

The significant section of content for this section can be found on pages 294-295 in the text book or in the 'L1 Translators" PowerPoint. Note that the parse tree diagram on page 295 is actually a syntax tree.

Some facts relating to a basic understanding are:

    • A compiler has a front-end analysis part and a back-end synthesis part (see textbook page 294).

    • An interpreter has the front-end part but not the back-end part.

    • The analysis is broken down into lexical, syntax and semantic stages.

    • In lexical analysis the code is tokenised and identifiers are recorded in the symbol table.

    • The results of parsing (syntax analysis) are recorded in a syntax or parse tree.

    • The product of the analysis stage is intermediate code.

    • In the back-end synthesis part of compilation, evaluation of expressions is done using conversion to Reverse Polish Notation (RPN).

    • Optimisation might take place as part of the synthesis.

Programs written in a high level language are either directly executed by some kind of interpreter or converted into machine code by a compiler (and assembler and linker) for the CPU to execute.

While compilers (and assemblers) generally produce machine code directly executable by computer hardware, they can often (optionally) produce an intermediate form called object code. This is basically the same machine specific code but augmented with a symbol table with names and tags to make executable blocks (or modules) identifiable and relocatable. Compiled programs will typically use building blocks (functions) kept in a library of such object code modules. A linker is used to combine (pre-made) library files with the object file(s) of the application to form a single executable file. The object files that are used to generate an executable file are thus often produced at different times, and sometimes even by different languages (capable of generating the same object format).

Object code is the code produced by a compiler from a program's source code. It could be machine code that can be executed or it could be code that requires a linker and/or a loader to create the machine code it depends upon the compiler and/or desired output from programmer running the compiler.

A simple interpreter written in a low level language (e.g. assembly) may have similar machine code blocks implementing functions of the high level language stored, and executed when a function's entry in a look up table points to that code. However, an interpreter written in a high level language typically uses another approach, such as generating and then walking a parse tree, or by generating and executing intermediate software-defined instructions, or both.

Thus, both compilers and interpreters generally turn source code (text files) into tokens, both may (or may not) generate a parse tree, and both may generate immediate instructions (for a stack machine, quadruple code, or by other means). The basic difference is that a compiler system, including a (built in or separate) linker, generates a stand-alone machine code program, while an interpreter system instead performs the actions described by the high level program.

A compiler can thus make almost all the conversions from source code semantics to the machine level once and for all (i.e. until the program has to be changed) while an interpreter has to do some of this conversion work every time a statement or function is executed. However, in an efficient interpreter, much of the translation work (including analysis of types, and similar) is factored out and done only the first time a program, module, function, or even statement, is run, thus quite akin to how a compiler works. However, a compiled program still runs much faster, under most circumstances, in part because compilers are designed to optimize code, and may be given ample time for this. This is especially true for simpler high level languages without (much) dynamic data structures, checks or typing.

Why Compile?

In order to reduce the complexity of designing and building computers, nearly all of these are made to execute relatively simple commands (but do so very quickly). A program for a computer must be built by combining these very simple commands into a program in what is called machine language. Since this is a tedious and error prone

process most programming is, instead, done using a high-level programming language. This language can be very different from the machine language that the computer can execute, so some means of bridging the gap is required. This is where the compiler comes in.

A compiler translates (or compiles) a program written in a high-level programming language that is suitable for human programmers into the low-level machine language that is required by computers. During this process, the compiler will also attempt to spot and report obvious programmer mistakes. Using a high-level language for programming has a large impact on how fast programs can be developed. The main reasons for this are:

    • Compared to machine language, the notation used by programming languages is closer to the way humans think about problems.

    • The compiler can spot some obvious programming mistakes.

    • Programs written in a high-level language tend to be shorter than equivalent programs written in machine language.

Another advantage of using a high-level level language is that the same program can be compiled to many different machine languages and, hence, be brought to run on many different machines

.

On the other hand, programs that are written in a high-level language and automatically translated to machine language may run somewhat slower than programs that are hand-coded in machine language. Hence, some time-critical programs are still written partly in machine language. A good compiler will, however, be able to get very close to the speed of hand-written machine code when translating well structured programs.

The lesson PowerPoint and links at the bottom of this page give more information on the stages of compilation and how an interpreter can execute without creating object code.

The following four videos describe this process in sufficient detail for the exam:

Final stage: object code generation

Overview video by Professor Brailsford

Sidenote: Compiling, linking, JIT Compilers, OSs and ByteCode

Is pre-run-time compilation and linking processes absolutely different from run-time compilation and linking?

Runtime compilation

The best (most well known) example is the just in time compilation used by Java. As covered in a previous lesson, Java code (similar to .NET's CIL (common intermediate language) ) is being compiled into bytecode which can be interpreted by the Java Virtual Machine (JVM). It's therefore different from let's say C++ which is first fully (preprocessed) compiled (and linked) into an executable which can be ran directly by the OS without any virtual machine.

The Java bytecode is instead interpreted by the VM, which maps them to processor specific instructions. That being said the JVM does JIT (just in time), which takes that bytecode and compiles it (during runtime) into machine code. Even in Java it can depend on which JVM you are using but basically there are pieces of code called hotspots, the pieces of code that are run frequently and which might be compiled so the application's performance improves. This is done during runtime because the normal compiler does not have (or well might not have) all the necessary data to make a proper judgement which pieces of code are in fact ran frequently. Therefore JIT requires some kind of runtime statistics gathering, which is done parallel to the program execution and is done by the JVM. What kind of statistics are gathered, what can be optimised (compiled in runtime) etc. depends on the implementation (you obviously cannot do everything a normal compiler would do due to memory and time constraints (you don't compile everything and usually only a limited set of optimisations are supported in runtime compilation). This is far beyond the requirements of the specification.

How are code sections that need to be compiled (linked) during run-time marked and where is that information kept?

Runtime linking

Linker is a different pair of shoes. We cannot use the Java example anymore since it doesn't really have a linker like C or C++ (instead it has a classloader which takes care of loading files and putting it all together).

Usually linking is performed by a linker after the compilation step (static linking), this has pros (no dependencies) and cons (higher memory imprint as we cannot use a shared library, when the library number changes you need to recompile your sources).

Runtime linking (dynamic/late linking) is actually performed by the OS and it's the OS linker's job to first load shared libraries and then attach them to a running process. Furthermore there are also different types of dynamic linking: explicit and implicit. This has the benefit of not having to recompile the source when the version number changes since it's dynamic and library sharing but also drawbacks, what if you have different programs that use the same library but require different versions (look for DLL hell). So yes those two concepts are also quite different.

Again how it's all done, how it's decided what and how should be linked, is OS specific, for instance Microsoft has the dynamic-link library concept.

If executed code is processor specific, why does it matter which OS it is run on? (taken from StackExchange)

You mention on how if the code is specific to a CPU, why must it be specific also to an OS. This is actually more of an interesting question that many of the answers here have assumed.

CPU Security Model

The first program run on most CPU architectures runs inside what is called the inner ring or ring 0. How a specific CPU arch implements rings varies, but it stands that nearly every modern CPU has at least 2 modes of operation, one which is privileged and runs 'bare metal' code which can perform any legal operation the CPU can perform and the other is untrusted and runs protected code which can only perform a defined safe set of capabilities. Some CPUs have far higher granularity however and in order to use VMs securely at least 1 or 2 extra rings are needed (often labelled with negative numbers) however this is beyond the scope of this answer.

Where the OS comes in: Early single tasking OSs

In very early DOS and other early single tasking based systems all code was run in the inner ring, every program you ever ran had full power over the whole computer and could do literally anything if it misbehaved including erasing all your data or even doing hardware damage in a few extreme cases such as setting invalid display modes on very old display screens, worse, this could be caused by simply buggy code with no malice whatsoever.

This code was in fact largely OS agnostic, as long as you had a loader capable of loading the program into memory (pretty simple for early binary formats) and the code did not rely on any drivers, implementing all hardware access itself it should run under any OS as long as it is run in ring 0. Note, a very simple OS like this is usually called a monitor if it is simply used to run other programs and offers no additional functionality.

Modern multi tasking OSs

More modern operating systems including UNIX, versions of Windows starting with NT and various other now obscure OSes decided to improve on this situation, users wanted additional features such as multitasking so they could run more than one application at once and protection, so a bug (or malicious code) in an application could no longer cause unlimited damage to the machine and data.

This was done using the rings mentioned above, the OS would take the sole place running in ring 0 and applications would run in the outer untrusted rings, only able to perform a restricted set of operations which the OS allowed.

However this increased utility and protection came at a cost, programs now had to work with the OS to perform tasks they were not allowed to do themselves, they could no longer for example take direct control over the hard disk by accessing its memory and change arbitrary data, instead they had to ask the OS to perform these tasks for them so that it could check that they were allowed to perform the operation, not changing files that did not belong to them, it would also check that the operation was indeed valid and would not leave the hardware in an undefined state.

Each OS decided on a different implementation for these protections, partially based on the architecture the OS was designed for and partially based around the design and principles of the OS in question, UNIX for example put focus on machines being good for multi user use and focused the available features for this while windows was designed to be simpler, to run on slower hardware with a single user. The way user-space programs also talk to the OS is completely different on X86 as it would be on ARM or MIPS for example, forcing a multi-platform OS to make decisions based around the need to work on the hardware it is targeted for.

These OS specific interactions are usually called "system calls" and encompass how a user space program interacts with the hardware through the OS completely, they fundamentally differ based on the function of the OS and thus a program that does its work through system calls needs to be OS specific.

The Program Loader

In addition to system calls, each OS provides a different method to load a program from the secondary storage medium and into memory, in order to be loadable by a specific OS the program must contain a special header which describes to the OS how it may be loaded and run.

This header used to be simple enough that writing a loader for a different format was almost trivial, however with modern formats such as elf which support advanced features such as dynamic linking and weak declarations it is now near impossible for an OS to attempt to load binaries which were not designed for it, this means, even if there were not the system call incompatibilities it is immensely difficult to even place a program in ram in a way in which it can be run.

Libraries

Programs rarely use system calls directly however, they almost exclusively gain their functionality though libraries which wrap the system calls in a slightly friendlier format for the programming language, for example, C has the C Standard Library and glibc under Linux and similar and win32 libs under Windows NT and above, most other programming languages also have similar libraries which wrap system functionality in an appropriate way.

These libraries can to some degree even overcome the cross platform issues as described above, there are a range of libraries which are designed around providing a uniform platform to applications while internally managing calls to a wide range of OSes such as SDL, this means that though programs cannot be binary compatible, programs which use these libraries can have common source between platforms, making porting as simple as recompiling.

Exceptions to the Above

Despite all I have said here, there have been attempts to overcome the limitations of not being able to run programs on more than one operating system. Some good examples are the Wine project which has successfully emulated both the win32 program loader, binary format and system libraries allowing Windows programs to run on various UNIXes. There is also a compatibility layer allowing several BSD UNIX operating systems to run Linux software and of course Apple's own shim allowing one to run old MacOS software under MacOS X.

However these projects work through enormous levels of manual development effort. Depending on how different the two OSes are the difficulty ranges from a fairly small shim to near complete emulation of the other OS which is often more complex than writing an entire operating system in itself and so this is the exception and not the rule.

Formal Languages

Regular expressions, Backus-Naur Form (BNF) and Reverse Polish Notation (RPN) are associated with string patterns and, like the study of Hieroglyphics, they can take some getting used to. The contents below will assist you in getting the information you need.

Natural & Formal Languages

Although formal and natural languages have many features in common - tokens, structure, syntax and semantics-there are many differences.

Natural Language

Natural language comprises of words and rules, called grammar, to define the order in which words may be constructed. These output of these rules allow, are called sentences. In addition, the rules that define a relationship between a sentence and real-world concepts are called the rules of semantics (which means the study of meaning). However, as can be seen in the table above, natural languages can be prone to ambiguity.

Formal Language

A formal language is, by design, unambiguous and written to solve some purpose. That could be stating the different possible hardnesses of a pencil or the combinations of car registration plates.

Formal languages have two components: Symbols and Syntax.

The alphabet of a formal language are its symbols or, more correctly, the letters. Remember, there symbols/letters are any allowable word, number, symbol, etc. These symbols are finite and we call sequences of these letters/symbols a string. These are the equivalent of sentences in a natural language.

Regular expressions do not form part of the specification, so have a separate page. They are useful to study because they give a good grounding to later investigating BNF and the use of formal languages. Equally, they are useful for complex validation. Follow the link for more information: Regular expressions (regex)

Backus-Naur Form (BNF)

BNF is a tool for describing the syntax of a context-free grammar. It is used here to describe Regular Expression Syntax.

BNF defines the set of all possible strings of symbols that constitute legal programs (i.e. strings) in a language. Each production rule in the grammar has an analogous BNF rule.

BNF Production Rules

BNF describes the syntax of a grammar using a set of production rules using terminal and nonterminal symbols.

terminal

Terminal symbols (characters or character sequences) are bracketed by the meta-symbol "'". For example: the symbol or character a

'a'

non-terminal

Non-terminal symbols are bracketed by the meta-symbols "<" and ">". For example: the non-terminal symbol set

<set>

production rule

Each production rule has a left hand side (LHS) and a right hand side (RHS) separated by the meta-symbol "::=" (read as "consists of" or "defined as"). The LHS is defined by the RHS.

The LHS is a non-terminal symbol. The RHS is some sequence of terminal and non-terminal symbols that define the rule. For example: set is defined as a subset and another subset.

<set> ::= <subset> <subset>

repetition

A symbol or symbols enclosed in curly brackets ( { and } ) denotes possible repetition of the enclosed symbols zero or more times. For example: set is defines as 0 or more subsets

<set> ::= { <subset> }

Alternate

The meta-symbol "|" (read as "or") is used to define alternate RHS definitions. For example: set is defined as a subset or a set and a subset

<set> ::= <subset> | <set> <subset>

Syntax Diagram

Part of the process of formalising a language and representing the constraints can be confusing to some. A syntax diagram is a pictorial way of representing the same definitions. These are covered in the lesson PowerPoint but their aim is to show how the various elements are split up. They are also called railroad diagrams.

The syntax diagram above shows that a factor can comprise a number or expression. The ellipse shapes show a terminal and the square a non-terminal, as defined above. A terminal character is non recursive where as a non-terminal is recursive and can comprise of any parts which define it. Iin the case above, expressions can comprise of numbers, further expressions, etc. Think of a terminal character being a letter, where as a word is comprised of one or more letters. A sentence is comprised of words and a full stop (terminal). The notes and lesson slides explain this further.

Parse Tree / Syntax Tree

Parse trees are not specifically mentioned in the specification, but are an important in the understanding of translators and prescient ordering. The text book on page 297 actually shows a syntax tree rather than a parse tree. A syntax tree is a simpler version of a parse tree. Both linked to syntax diagrams and BNF notation. Do not confuse syntax diagrams with syntax trees.

NB. The third RPN video also gives more explanation on trees as part of the RPN discussion.

The diagram above shows a parse tree (left) and a syntax tree (right) - also called an abstract syntax tree (AST). Parse trees are often converted into a simplified form known as a syntax tree that eliminates wasteful information from the parse tree. Both of these would be read: 3 + 4 * 5.

The syntax tree sees all nodes as terminal, which means they do not acknowledge the distinction between terminal and non-terminal categories.

The syntax tree is used intensively during semantic analysis, where the compiler checks for correct usage of the elements of the program and the language. The compiler also generates symbol tables based on the AST during semantic analysis. A complete traversal of the tree allows verification of the correctness of the program.

This example, taken from Wikipedia shows a syntax tree created for a block of code.

This tree is based on code for the Euclidean algorithm:

while b ≠ 0

if a > b

a := a − b

else

b := b − a

return a

Technical Note (feel free to ignore): I have made a distinction between parse trees and syntax trees, as these are commonly used. Technically, however, both are parse trees. A true parse tree (as used above) is known as a constituency-based parse tree, and a syntax tree as a dependency-based parse tree.

Reverse Polish Notation (RPN)

Reverse polish notation (RPN), sometimes referred to as postfix notation, is a way of writing mathematical expressions where each operand is preceded by the two operators it applies to and looks something like 2 5 3 + * instead of (5 + 3 ) * 2. Its purpose is to eliminate the need for brackets and make processing calculations straightforward for a computer compared to the traditional 'infix' notation.

The video below will explain infix and postfix in simple terms. Only watch this video if you a) are struggling with understanding postfix, or b) want a very rudimentary introduction into the topic as it starts from the very beginnings of BIDMAS infix notation. Ignore the parts on prefix, this is not covered in the course.

Take 4 + 5 (infix notation), which becomes (4 5 +) in postfix (RPN) notation.

The rule is that the operator acts upon the preceding two numbers. Therefore adding 4 and 5 together is still 9. But why use this curious method? Well, it happens to be, that reverse polish can be used in conjunction with the CPU stack to evaluate expressions

Let's work on a less trivial example to show that infix and reverse polish are equivalent. Consider the expression below

6*(4+5) - 25/(2+3)

The answer is 49

The same expression laid out in reverse polish look like

6 4 5 + * 25 2 3 + / -

To work it out, start from the left and go to the first operator then carry out that operation on the preceding two numbers. So .. (ignore the presence of the grey part, it is only there to remind you what the full expression looks like)

6 4 5 + * 25 2 3 + / -

becomes

6 9 * 25 2 3 + / -

the next operator is a multiply * so the expression is now

6 9 * 25 2 3 + / -

which is 54, so the expression up to the next operator look like

54 25 2 3 + / -

so add the 2 and 3 together to make 5, the expression is now

54 25 5 / -

The next operator is a divide, so divide 25 by 5 to get the expression so far as

54 5 -

So finally subtract 5 from 54 and the answer is 49. The same as the infix expression.

By laying out the expression in reverse polish, you avoid the use of brackets to determine precedence. We will use the same example on the next page to show how the stack can be used to work out the answer.

Indices

Indices are the exception to the left to right method you take when using this particular method of postfix (there are differing ways different methods construct precedence).

For example 4 ^ 3 ^ 2 is constructed from right to left..

i.e. 4 (3 ^ 2) not as you might think (4 ^ 3) ^ 2

Giving a postfix answer of 4 3 2 ^ ^

Here are some further examples using indices.

Brackets have been added to help show which parts are being changed. In addition, equal operators can be acted upon at the same time, but the steps have been broken down for educational instruction, rather than brevity.

1) A * B ^ C + D becomes

A * (B ^ C) +D --> A * (B C ^) +D

(A * (B C ^)) + D --> (A (B C ^) * ) + D

(A (B C ^) * ) + D --> (A (B C ^) * ) D +

Remove the brackets to leave: A B C ^ * D +

2) A + B + C ^ D ^ E * (F - G) / H becomes

A + B + C ^ D ^ E * (F G -) / H

A + B + C ^ (D E ^) * (F G -) / H

A + B + (C (D E ^) ^ ) * (F G -) / H

A + B + ((C (D E ^) ^ ) (F G -) *) / H

A + B + (((C (D E ^) ^ ) (F G -) *) H / )

(A B +) + (((C (D E ^) ^ ) (F G -) *) H / )

(A B +) (((C (D E ^) ^ ) (F G -) *) H / ) +

Remove the brackets to leave: A B + + C D E ^ ^ F G - * / H

EXAM NOTE:

Recent questions do not use brackets, but still expect you to apply the rules of BIDMAS. Therefore the following expression: A+B-C/D would be evaluate (A+B) - (C/D) and translate into an RPN expression: AB+CD/-

RPN and the Stack

The following example comes from teach-ICT. The site includes an animation of the process.

Stack algorithm:

    1. Software reads the expression from the left

    2. If it encounters a number, it pushes that number onto the stack

    3. If an operator is encountered, it pops two numbers from the stack and carries out the operation

    4. Push the calculated result onto the stack

    5. end evaluation once the last item in the expression has been calculated

In the above process, remember that the two constants that you pop need to be reversed for the operation (irrelevant for + and *). E.g. if the stack contained 3, 4, 7, 2 (3 being the bottom of the stack), you would pop 2 and 7, but the order would be 7 - 2, NOT 2 - 7. This makes sense when you consider the order in which you add values to the stack, with the first value added before the second.

Examples

These examples have been taking from the meta-calculator site.

As the last example illustrates, RPN or postfix notation removes the need for parentheses! To give any operation greater precedence in RPN, just place the associated operand (- in our example) immediately to the right of the two operators (12 and 3)!

More examples:

The following Computerphile video explains RPN very well, along with the use of the stack. There are some additional videos below that explain the process in another way, including the second video which is a follow-on video that explains the history of RPN. With the third video, try to get past his pronunciation as the explanation is good and clear.

Finally, here is a detailed video showing use of the stack in converting infix to postfix using the shunting yard method.

There is a very good website which has lots of practice questions with step-by-step solutions.

Additional Content

Links

The following websites provide further reading and are a source of some of the content for this page: