5.2 Language Translators

Files

Specification

Show understanding of the need for:
- assembler software for the translation of an assembly language program
- a compiler for the translation of a high-level language program
- an interpreter for translation and execution of a high-level language program
Explain the benefits and drawbacks of using either a compiler or interpreter and justify the use of each
Show awareness that high-level language programs may be partially compiled and partially interpreted, such as Java
Describe features found in a typical Integrated Development Environment (IDE), including
- for coding, including context-sensitive prompts
- for initial error detection, including dynamic syntax checks
- for presentation, including prettyprint, expand and collapse code blocks
- for debugging, including single stepping, breakpoints, i.e. variables, expressions, report window

High Level code vs machine code

Introduction

Computers use digital signals to work but humans don't think and work in 'digital'. We use words and sentences to express meaning. The question here is, 'how can a computer, which is digital, be programmed to do useful things by a human, who is not digital'?

Bits and bytes

Computers use digital signals. What this actually means is that the electronic chips inside any computer work by using just two different voltage levels. We humans can represent these voltage levels simply by using a 'one' for a high voltage and a 'zero' for a low voltage. A single BInary digiT (or bit) is either a one or a zero. By grouping together these bits (usually in groups of eight, known as a 'byte') we can create patterns. These patterns can then be used to represent instructions, which tell a computer's processor to do something (like 'add', or 'subtract', or 'store some data' and so on), or data, which can be worked on by the processor. Here is a stream of bytes:

11010011 01110000 11010101 10000111 11111101 01010001 11110001 01111100 11000000

The problem here for humans is that it is very difficult for us to read these patterns of ones and zeros. We don't know easily what each of the above patterns represent. What instruction does the first byte represent, or is it a piece of data, or is it in fact partly an instruction and partly data?

Machine code

Every computer has a processor. That processor has a fixed number of instructions which it can carry out. All of the instructions together, which a particular processor can carry out, is known as the 'machine code' for that processor. Any particular instruction can be represented as a binary pattern.

Low level languages and assembly programs

Early programmers wrote programs in binary, which the processor of a computer could work with although we know humans don't find working with patterns of ones and zeros very easy at all. For this reason, programming 'languages' were developed.

The first kind of programming languages were known as 'assembly languages'. These are also known as 'low level languages' because the instructions used in assembly languages are very close to the machine code instructions CPUs use. Instead of writing in binary patterns, which corresponded to machine code instructions, programmers wrote in 'mnemonics'. Here is an example of an assembly program:

ADD #344A

DEC IY

CALL Page

LD B, #1195

A 'mnemonic' is a code for a processor's instruction, which is easy to remember. 01110111 isn't easy to remember but ADD is. It is easier to remember a few hundred mnemonics that make up the machine code for a particular processor than it is to remember a few hundred different bit patterns! Whilst this was an improvement, it was still quite hard to learn how to write programs, to read them and modify them.

High level languages

The next type of language that was developed used actual English words rather than mnemonics. They were known as 'High level languages' and aimed to overcome the problems of learning to read, write and modify programs written in machine code or assembly languages by using English Keywords. These types of languages could now be used to express problems in much the same way as humans might logically solve a problem, which was a major improvement on what went on before. A typical program might look like this:

WRITE "Press C to continue".

READ KeyPress

WHILE (KeyPress NOT EQUAL TO C) DO

BEGIN

WRITE "Press C to continue".

READ KeyPress

END

ENDWHILE

Can you work out what the above program actually does? You can see real words like 'WRITE' and 'READ' in the program. These are known as 'keywords' or 'reserved words'. Any particular programming language has a set of these special keywords and the programmer can use them to write programs and make the processor behave in a certain way.

There are many different high level languages in existence. You may have heard of some of them: BASIC, Java, JavaScript, Visual Basic, C and so on. They each have their advantages and disadvantages, their fans and people who don't like them and jobs which they are particularly good at being applied to. Can you find the names of some other high level languages and try to find out what sort of jobs they are good at solving?

The need for translators

Introduction

We know that processors are digital devices and use machine code. We know that humans find it hard to write in machine code, preferring instead to write either in a low level language or a high level language. low and high level language programs, however, are not in machine code and therefore the processor cannot run these programs as they stand. The computer needs to 'translate' them first into machine code instructions, and then run the machine code.

The video below gives a brief introduction to interpreters and compilers. However, note that the section on different languages for compilers or interpreted is factually incorrect. A language itself is neither compiled nor interpreted. Its implementation is. There can be interpreters and compilers for the same language - E.g. Python can be both interpreted and compiled.

Bytecode, Intermediate Code, Object Code, Machine Code, etc.

Source code can be translated into different types of code, depending on its future use. Different languages support different types of code, away from the original source code. Mostly this depends on if the code is to be executed directly by the target CPU, or if it will be fed into a virtual machine (e.g. Java VM), or .NET framework, etc. Each is listed in more detail below, but here is a handy summary of the different types of translated code options. Remember, machine code is the only code that ever runs! It's the process of getting to machine code that varies between these.

This video, which is not short, looks in part at how Java Virtual Machines (JVMs) work, and their usefulness. Remember, the specification only asks for an awareness of the use of such architecture, so the video holds much detail you do not need to know, but might find interesting.

Translators

When a program is written in a low or high level language, it has to be converted into machine code instructions, so that the processor can actually run it. This is the job of a 'translator' program.

Source code - translator - object code

A translator program takes the original high level language program, which is more commonly called the 'source code ' or the 'source file' and converts it into the equivalent machine code instructions. The converted code is commonly known as the 'object code'. This object code can then be run by a computer's processor.

Assemblers, compilers and interpreters

Introduction

We have already seen that processors can only run machine code instructions. If a program is written in a high level language, or indeed an assembly language, then it has to be translated into machine code instructions before the processor can run it. There are three types of translators: assemblers, compilers and interpreters.

Assembler

An assembler is the type of translator that converts assembly programs into machine code instructions. We saw in a previous section what part of an assembly program might look like. It is made up of 'mnemonics'. These help the programmer remember instructions.

ADD #344A

DEC IY

CALL Page

LD B, #1195

Although it may not appear so at first, these mnemonics are quite close to machine code. Each of these assembly program instructions in general can be converted to just one machine code instruction by the assembler so in general, these types of programs run very quickly compared to compiled or interpreted programs.

Compiler

Programs written in some high level languages are 'compiled' to get them into object code, which the processor can then use. A compiler is a type of translator, which takes an entire program after it is finished (the source code) and converts it into object code in one complete go. Unlike assemblers, a single keyword in a compiled program will get converted into many machine code instructions. If there are any problems with the code, the compiler will report these problems at the end of the compilation process. If the programmer makes any corrections, then the whole program has to be re-compiled again. Once compiled with no errors, the object code can then be run by the processor.

Interestingly, the object code is what you would buy in a shop when you buy a game, for example. Once you have the object code, it will run without the compiler. The compiler's job is finished once it has translated the code. You can therefore put the object code (but not the original source code) on a DVD and sell it. If anyone tries to view the object code to see how the program was written, they won't get very far! All they will see is a lot of ones and zeros. Getting it back to the original source code (called 'reverse engineering') is very difficult to do.

Interpreter

Programs written in some high level languages are 'interpreted' rather than compiled. The key difference here is that the first line of the source code is translated and then run by the processor, and then the second line is translated and run by the processor, and then the third and so on, until the program has finished. (Compilers turn all of the source code into object code in one go first and then run it). Unlike assemblers, a single keyword in an interpreted program will get converted into many machine code instructions.

Interpretation produces programs, which run much slower than compiled programs. You also have to have the interpreter program in RAM at the same time as the object code you are producing as it is always needed in the translation process. (In compiled programs, you only need the compiler program in RAM whilst you are converting the source code into object code - after it has finished, you can close the compiler program down and just run the object code on its own).

A key reason for using interpreted programming languages rather than faster compiled languages comes from the fact that as soon as an interpreter finds an error, it stops! That means the programmer can see exactly where there is a problem in the program, correct it, and then they can continue from that point onwards. With compiled programs, you get all the error messages at the end, sometimes they are not easy to understand or pinpoint where exactly the error is in the code and you then have to recompile all the code after each error has been corrected.

So interpreters huge benefit over compilers is that they are really good for developing programs and 'debugging' them.

This video, to be also found in section 4.3, looks at different assemblers and covers topics such as relative, indirect addressing and directives, etc.

Translators: Extra Information

The need for translators

Whether the source code for a program is written in a low-level language or a high-level language, it must be translated into the code that the computer can use, the ones and zeros, before the CPU can actually run it. The source code is passed to a special translating program that then converts it into ‘object code’. The object code is usually in the 'ones and zeros' form of the computer and for this reason, object code is used interchangeably with the terms machine code or executable code or even executable machine code! This is the way most languages work although just to confuse things, some languages such as Java do not produce machine code upon translation but produce an ‘intermediate code’ that is then converted into machine code when executed. This will be discussed in more detail in the chapter to do with object oriented languages. Some languages such as C take the whole source code and translate it in one go using a compiler. The object code then runs very quickly. Other languages such as BASIC take one line of the source code at a time. It translates that one line using an interpreter and then runs that one line. Then it gets the next line and repeats the process. Interpreted code runs much slower than compiled code but it is very useful for writing, developing and debugging programs because the program will run correctly up to any error in the program. It will stop if it finds an error. The programmer can then examine the code at that point and re-run it, without the need for re-compilation. This is very useful and a much simpler process than for debugging compiled languages.

Why companies usually distribute object code and not source code

When you buy an application or a game, for example, you are usually buying just the object code, not the source code.

If you were sold the source code, you would need to ensure that you had the correct translator program so that you could convert your application into something the computer could understand, into object code! Suppose you didn’t have the correct translator? You wouldn’t be able to run the program! If you received the object code when you bought an application or game, you could simply execute it. You wouldn’t need to translate it.
You also you wouldn’t need to use up so much of your valuable RAM. This is because you wouldn’t need RAM to store the source code, or the translator program, or any temporary storage whilst the translation was being done. You would only need the RAM to store the object code.
If you do only have the object code in your possession, however, you would not be able to modify it in any way. It is very difficult to take some object code and reverse engineer it back into source code.
The source code itself would be jealously guarded by the company who wrote the application. They would want to protect their copyright. (Writing software is a very expensive business). If they needed to make any changes or updates to the application, they would be made to the source code as required. After any modifications, the source code would need to be translated again and the updated object code made available to users, perhaps via the web.

Different types of translators

There are three different types of translating program. Which one you would use depends upon the actual language you are writing the source code in. The three types of translator are:

Compilers
Interpreters
Assemblers

We can summarise the use of the three types of translator as follows:

If you wrote a program in a low-level language called Assembly then you would translate the source code into object code with an assembler.
If you wrote a program with certain high level languages such as Pascal, C or COBOL, you would translate the source code into object code using a compiler.
If you wrote a program with certain high level languages such as BASIC or Perl, you would translate the source code into object code using an interpreter.
If you wrote a program with certain high level languages such as VB or JAVA you would translate the source code into an intermediate code using a compiler. The intermediate code would then be run with an interpreter.

We can summarise what happens with the following diagram.

Assembly, machine code and assemblers

Understanding assembly, machine code and assemblers

In the Resources part of the web site (click on the FAQ and Resources link at the top of the page) you will find links in the Resources area to some excellent simulations, such as the Little Man computer and SimCom. It is a lot of fun writing your own programs for these and you will begin to really understand how programming (the software) works with the CPU, RAM and the registers (the hardware).

High-level languages

People write programs in programming languages! There are many different languages, and many different ways of classifying, or grouping together, languages. You will have been learning a language on the course. You might be learning Visual Basic, or Pascal, or Java, for example. Below is an example of a short Java program. Can you work out what it does?

//My first program

public class Hello

{

public static void main(String[] args)

{

System.out.println("Hello World");

EasyIn.pause();

}

//This is the end

Java, Visual Basic, Pascal and so on are examples of a group of languages known as ‘high-level languages’. These are languages that use ‘keywords’ in programs, words that are very close to English. These types of languages also refer to memory locations using variable names rather than their actual addresses in RAM. For these reasons, high-level languages are easier to learn than low-level languages. They can also be more easily ‘read’ and so are easier to debug. These kinds of languages, because they are written using keywords, must be changed (or ‘translated’) into a form that the Central Processing Unit can understand (known as ‘machine code’) before they can be run. A computer does not ‘understand’ the word PRINT. It does, however, know how to make the hardware print something out if it receives the machine code instruction for the keyword PRINT. The machine code equivalent of PRINT might look like this: 00010101 00001001 01110010 11110101 01110010.

Low-level languages

Low-level languages do not use keywords such as 'print', 'pause', 'if' 'while' and so on. They use mnemonics. These are codes for instructions, such as DEC (short for ‘decrement’) or LD (short for ‘load’). Mnemonics are designed to be easily remembered (hence the word 'mnemonic'). An example of a short low-level program using mnemonics is shown below.

ADD (#344A)

DEC IY

CALL PAGE

LD B, #1195

Low-level languages are much closer to the workings of a computer. Often, control programs that require very fast execution speeds are written in a low-level language, known as assembly. This is because when they are converted into machine code (using an assembler), they produce less machine code than if the equivalent program was written in a high-level language. They therefore run faster! There are also applications that require you to manipulate a computer's hardware in a way that is difficult to do with high-level languages. For example, a programmer might use a low-level language to write a print driver for a new printer and a high-level language to write a sales program that calculates salesmen's commissions.

High-level languages are designed to allow a programmer to solve real-world problems.
Low-level languages are designed to allow a programmer to manipulate a computer's hardware. In fact, they are sometimes known as ‘machine-orientated languages’.

Low-level languages, while quite difficult to learn compared to the newer high-level languages, are a big improvement on what went on before. Programmers had to write in the code that the computer could actually work with and use! Below is an example of part of a machine code program. Imagine writing programs in ones and zeros!!

0100 1010 0000 1101

0000 1011 0111 0111

0011 1111 0010 1000

Intermediate Code

Compilers plus interpreters

Some programs written in languages such as JAVA are both compiled and interpreted! A program is firstly compiled into an intermediate code known as bytecode. It is then distributed to users who use a wide range of computers such as Mac or PCs. These computers then run their own 'interpreter' to convert the bytecode into a code they can use. Languages such as JAVA are said to be ‘platform-independent’, because any program written in that language can run on any machine. These types of languages are ideal for use on the Internet; you don't need to know anything about the PC that will be running your code!

Java

Computers such as an IBM clone (a 'normal' PC) or a Macintosh each have their own CPUs that use their own machine code. If you write a program in PASCAL, for example, you can run it on a school PC only after you have translated it using a compiler into machine code. You couldn't, however, take that object code and run it on a Macintosh - because it has a different CPU that has a different instruction set. You would have to retranslate the source code using a different compiler. Java is an OO high level language. It was designed so that the code can run on any machine! How does it do this?

When a program is compiled, it is compiled into code known as Java 'bytecode', for a machine that doesn't exist, called a Java virtual machine!! The bytecode can then be distributed to different types of computers. Each of these types of computers will need to have their own type of interpreter (rather than a compiler). These interpreters can take bytecode and run it on that computer line-by-line

Why not simply miss out the Java bytecode stage and distribute the source code and have a compiler for each type of machine rather than an interpreter? Amongst other reasons, compilers are more complex programs compared to interpreters. If you have a new type of CPU it is far easier to write a new interpreter than a new compiler.

Java and the Internet

Java is used extensively on the Internet. Small programs called applets are written by programmers and transmitted with html code across the Internet. If you have a browser that has a Java bytecode interpreter (most of the latest ones have!) and you have enabled your browser to accept Java applets, then they will run when downloaded. Suddenly, very boring html web pages can be turned into anything the programmer wants to turn them into! Not everyone likes the idea of downloading and running programs not guaranteed to be virus-free and which may compromise personal privacy. As a result, some people disable Java applets on their PC!

IDEs

Features of an IDE

The Integrated Development Environment (IDE) provides a number of features to assist with initial program debugging, including:

Syntax checking (on entry)
Structure blocks (e.g. IF structure and loops begin/end highlighted)
General prettyprint features
Automatic indentation
Highlights any undeclared variables
Highlights any unassigned variables
Commenting out/in of blocks of code
Visual collapsing / highlighting of blocks of code
Single stepping
Breakpoints
Variable/expressions report window

This video, although using VB6 still contains a number of elements included in the specification. As part of the debugging section, you should notice the use of single-stepping (run a line at a time) and breakpoints (the red lines).