1.1 Data Representation

Specification

Number Systems

Like with the SI system of number representation, computer systems also have a hierarchical system of representing bits.  A bit is the smallest unit that can be read and understood by a computer.  Each successive unit comprises of two or more bits.

A word, represented here as two bytes, is actually dependent on the operating system and processor.  In short, it is the largest possible value that can be transferred to the processor in one clock cycle.

Bytes use the SI system (kilo, mega, etc), but because binary is base 2, each goes up in units of 1024, not 1000.  There was, and still is, large confusion when using SI units in computing because some people will assume a kilobyte is 1000 bytes, while others, correctly, assume 1024 bytes in a kilobyte.  

Too avoid future conversion, the International Electrotechnical Commission (IEC) put forward a new system of unit representation, which is slowly catching on.  Under this new system the SI units are considered base 10, while the new units are always base 2 (i.e. factors of 2, not 10).

For clarification, I will always refer to, and calculate, the SI units as base 2.

Number Base Conversion

A number base is simply how many symbols are available in that number system.  Binary can be either 0 or 1, which makes it base 2.  Decimal (denary) is base 10 because we have symbols 0 to 9.  Hexadecimal is base 16 because we have 0 to 9 and A, B, C, D, E and F.

Computer systems use binary, humans use denary and when expressing large binary values, hexadecimal (often called hex) is better.  Because there are different number bases, we often need to convert between them.

The sections below break down the different bases, but this Wikibooks website gives a comprehensive background to what you need to know.

There are two methods which work no matter what the number system, when converting either from or to denary.

Converting Between Units

When converting from a larger unit to a smaller unit, you will always need to multiply, and divide when going the other way.  E.g.

[Number base] TO Denary

The video to the right shows how to convert from any base to base 10 (denary/decimal).  Remember to look at the wiki books link (above) for a comprehensive overview of all number base calculations.

This wiki page explains positional notation.

Denary TO [Number base]

The video on the right (linked to the one above) shows a universal method of converting from base 10.  There are other methods, such as using subtraction, to achieve this, but the method below is preferred.

This wiki page explains positional notation.

Binary TO Hexadecimal (and vice versa)

This is possibly the easiest conversion of all.  Because a single hexadecimal digit is exactly 4 bits, you simply can do a lookup.  Look at the table below.

TIP: Drawing the table below is very simple when you see the binary pattern.  As each column increases in powers of two (starting 1, 2, 4, 8 etc) you simply repeat the pattern that many times.

 The decimal equivalent is shown here for clarification, it is not needed when converting.

Given the hex value D3F, to convert this to binary we simply look up the binary for D, 3 and F.

D = 1101

3 = 0011

F = 1111

Therefore D3F = 1101 0011 1111 in binary.

When converting from binary, always start at the least significant bit and group in 4s.  if there are only say 2 bits left, pad with two extra 0s.  E.g.

101 1011 1001

  5      B     9

Notice how the 101 was read 0101.

Full Number Base Chart

Test Your Knowledge

Here are some websites and games to ensure you understand this content. Make your own too.

Interactive binary game

Cisco binary game

Why Binary?

Two's Complement

Two's complement is a modification of one's complement, which uses the most significant bit (MSB) as a sign indicator.  Remember that unsigned binary numbers are neither positive or negative.  Two's complement improves on one's complement by removing the possibility of a +0 and -0.

Two's complement is not complicated.  Two's complement is an extremely popular way for computers to represent integers. Some floating-point (storing fractions) methods also use two's complement. To get the two's complement negative notation of an integer, you write out the number in binary. You then invert the digits, and add one to the result.  When writing out the binary number, always add at necessary trailing 0 to the number before converting to two's complement (exam questions will always give a certain number of bits in which the number MUST be represented).  Also note that the two's complement process ONLY need be followed when dealing with negatives.

An exception for largest negative value

Unfortunately, due to the way two's complement ensures an even number of odd and even numbers, there is an exception for the largest negative value.  This is the only exception and the usual method of inverting a flipping doesn't really apply in a traditional sense.  I.E. the largest negative value in a 5 bit two's complement is 10000 (-16) -- noting that there is NO equivalent as the positive numbers range 0 to 15.

Converting To Two's Complement

Example 1:

Let's take an example 8 bit storage and suppose we want to find how -35 would be expressed in two's complement notation. First we write out 35 in binary form:

00100011

Then we invert the digits. 0 becomes 1, 1 becomes 0.

11011100

Then we add 1:

11011101

That is how one would write -35 in 8 bit binary.

You might notice that, using two's complement, the lowest and highest values that can be stores are -128 and +127, this is because in the above example, the 8th bit is used for the sign.  To store > 127 we need to increase the number of bits.

Example 2:

Let's take an example that can confuse students, again using 8-bit:  -128.  First write 128 in binary, which is:

10000000

Now invert the digits:

01111111

Now we add 1:

10000000

Until students get used to two's complement, they feel this calculation should be 11111111, which in fact is -1. +127 is 01111111.

Converting From Two's Complement

To convert from two's complement to decimal, you use exactly the same process, as long as it's a NEGATIVE (i.e. 1 as the MSB).

Example 1:

Let's take this 8 bit two's complement: 

10001110.  

First invert the digits:

01110001

Add 1:

01110010

Remembering that this was originally a negative, converted to decimal is -114.

Other Benefits

A major benefit of two's complement is subtracting.  With two's complement, we only have to add the two's complement of a number to subtract. 

Example 1:

Let's take 12 - 12.  12 and -12 in 8-bit binary are:

12:  00001100

-12: 11110100

Added together are:

00000000, which is 0 (ignoring the overflow)

Example 2:

Let's take another example: 105 - 62

We convert -62 into two's complement and add together as shown below:

105: 01101001

-62:  11000010

Added together give:

00101011 = 43

Example 3:

Finally, let's calculate 43 - 87:

43:  00101011

-87: 10101001

Added together give:

11010100 = -44

This number is a negative, so it was changed back using two's complement conversion.

Storing a negative number in HEX

To convert a negative number to hexadecimal, you need to first convert the binary number to two's complement, and then follow the process below.

Two's complement representation to HEX

To convert a binary two's complement to hex, you calculate what the binary value would be (including the correct power 2 for the MSB) and store that as hex. (you can also use the same nibble shortcuts to convert as in any other bin --> hex conversion)

Negative HEX to Decimal

The question will need to state which method is being used (e.g. two's complement).  You then convert that hexadecimal number to binary, invert the bits, add 1 and then calculate the value of the binary (remembering that this needs to be signed negative).

Two's Complement Calculator

Here is a two's complement converter so you can check your calculations.

Character Sets

Character sets are defined in terms of encoding mechanisms.  For the exam, you are expected to understand ASCII and Unicode.  You DO NOT need to know individual characters (e.g. A=65), but you should be able to explain how the systems work and encode/decode if necessary.  The encoding mechanism simply dictates the rules from translating the string of characters into a sequence of bytes.  Encoding in computing terms is simply the process of taking a piece of data and deciding how to store it on a computer.

Note:  The character set does not determine how it is displayed or printed.  Each character is mapped to a particular glyph (think 'image' - usually a vector set of coordinates) in a given font.  A common font system is TrueType (designed by Apple).  It is the font that determines how, for example, character 65 is printed.  This is why in some fonts, 65 might be a musical note or other glyph (e.g. using Wingdings).

ASCII

The American Standard Code for Information Interchange, or ASCII for short, is a 7/8-bit character set used in English speaking countries.  Its use has mostly been replaced with Unicode (predominantly UTF-8) but is still used today for simple applications. The original ASCII codes used 7-bits but this was later extended to use 8-bits (called extended ASCII).  The table below shows the standard 7-bit ASCII symbols.  Note:  0-31 are not printable characters, but control codes).  When viewing an ASCII file, not all characters will be displayed.  There are characters such as end of line, tab and carriage return characters that affect the formatting without being obvious.  This Wikibook page gives a little more detail along with sample exam questions.

https://upload.wikimedia.org/wikipedia/commons/thumb/1/1b/ASCII-Table-wide.svg/2000px-ASCII-Table-wide.svg.png

Click on the table to view the original image.

The following video explains ASCII in simple terms.  Feel free to jump around the video as some parts are either irrelevant or long-winded.

UNICODE

With intern-connected computers and eventually the Internet, there grew a need to have a universal encoding scheme that could be used by all countries, without having to deal with the hundreds of local encoding schemes that grew out of the computing boom.  A group was established to create such a universal encoding scheme and it was called Unicode.  The simple aim, to create a gigantic table of all possible glyphs and assign each a unique number.  This is, in essence, what Unicode is.  However, and confusingly, Unicode doesn't specify how each character is stored on a computer, this is where Unicode transformation format (UTF) encoding schemes are used.  There are three main ones, UTF-8; UTF-16 and UTF-32.

When talking about character encoding (E.g. ASCII or Unicode) we are referring to the storage of code points. A code point is a value that represents a given character, unique from the others.  ASCII comprises 128 code points from 0 to 127 (7Fhex).  Extended ASCII ranges from 0 to 255 (FFhex).  Unicode comprises 1,114,112 code points ranging from 0 to 1,114,112 (10FFFFhex).  Code points are usually denoted in base 16 using the format U+F4A which represents the decimal value 3,914 (A Tibetan letter  ཊ)

An excellent place to start, when looking at Unicode is this excellent article.  It explains Unicode in simple terms, and in more than enough detail than needed for your exam.  The article explains UTF-8 (Unicode Transformation Format 8-bit), which is different from UTF-16 and UTF-32.  

Exam Note

CIE advise that for the exam that "Candidates need to know the most common Unicode standard(s). They must be able to explain what Unicode is and why it is used instead of ASCII.". My suggestion is that you familiarise yourself with UTF-8 and take a passing glance at UTF-16 and UTF-32.  Do not worry about other formats such as UTF-16LE, etc.

This video Computerphile video by Tom Scott explains the advance from ASCII to UTF-8.

This Wikibooks page will give you an overview of Unicode along with some questions which could come up in the examination.

Unicode Extra Detail

Firstly, this Microsoft blog document gives some good background on ANSI/ASCII and UNICODE.

In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes.  UTF-8 uses the following rules:

How UTF-8 works

Picture taken from Joelon Software.

UTF-8 has several convenient properties:

This somewhat humorous article gives more detail relating to Unicode.  The content below is an extract from this article.

There Ain't No Such Thing As Plain Text.

If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly.

Almost every stupid "my website looks like gibberish" or "she can't read my emails when I use accents" problem comes down to one naive programmer who didn't understand the simple fact that if you don't tell me whether a particular string is encoded using UTF-8 or ASCII or ISO 8859-1 (Latin 1) or Windows 1252 (Western European), you simply cannot display it correctly or even figure out where it ends. There are over a hundred encodings and above code point 127, all bets are off.

How do we preserve this information about what encoding a string uses? Well, there are standard ways to do this. For an email message, you are expected to have a string in the header of the form

Content-Type: text/plain; charset="UTF-8"

For a web page, the original idea was that the web server would return a similar Content-Type http header along with the web page itself -- not in the HTML itself, but as one of the response headers that are sent before the HTML page. 

This causes problems. Suppose you have a big web server with lots of sites and hundreds of pages contributed by lots of people in lots of different languages and all using whatever encoding their copy of Microsoft FrontPage saw fit to generate. The web server itself wouldn't really know what encoding each file was written in, so it couldn't send the Content-Type header.

It would be convenient if you could put the Content-Type of the HTML file right in the HTML file itself, using some kind of special tag. Of course this drove purists crazy... how can you read the HTML file until you know what encoding it's in?! Luckily, almost every encoding in common use does the same thing with characters between 32 and 127, so you can always get this far on the HTML page without starting to use funny letters:

<html>

< head>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

But that meta tag really has to be the very first thing in the <head> section because as soon as the web browser sees this tag it's going to stop parsing the page and start over after reinterpreting the whole page using the encoding you specified.

Binary Coded Decimal (BCD)

Binary coded decimal or BCD is a binary encoding scheme where each denary/decimal digit is represented by its own binary pattern (usually set at 4 or 8 bits). BCD is useful when displaying data to users using segmented displays (e.g. old calculators before LED/OLED displays became popular).  The circuitry needed to display each digit is made easier when the number is stored in BCD.  BCD is not ideal however, as there is wastage when we only need to display numbers (0-9).  Taking a 4-bit encoded pattern, we only ever use from 0000 to 1001.

7-segment-display.jpg

To find out more about BCD, here is a link to the Learning Electronics site.  In addition, this Wikipedia entry gives more detail regarding BCD.  An extract from this entry is given at the end of this section.

Converting From/To BCD

Converting between BCD and standard binary is straight forward, if you are happy to convert the long way.

Note: The choice of which method to use is yours. Both methods can be implemented in code, but the shift +3 method is far more efficient and simpler to implement with electronics.  For the exam, you simply need to convert between the two binary representations.  No particular method is expected or assumed.

http://hyperphysics.phy-astr.gsu.edu/hbase/electronic/number3.html

Easy Method

To convert from BCD to binary:

To convert from binary to BCD

More Efficient Method

The more efficient method is not as straightforward, but is much more easily implemented in code.  Rather than duplicate the method, here are some web resources that explain both conversion types.  

The conversion from BCD to binary uses a similar method.  Here is is explained in text.

Applications

BCD is very common in electronic systems where a numeric value is to be displayed, especially in systems consisting solely of digital logic, and not containing a microprocessor. By utilising BCD, the manipulation of numerical data for display can be greatly simplified by treating each digit as a separate single sub-circuit. This matches much more closely the physical reality of display hardware—a designer might choose to use a series of separate identical seven-segment displays to build a metering circuit, for example. If the numeric quantity were stored and manipulated as pure binary, interfacing to such a display would require complex circuitry. Therefore, in cases where the calculations are relatively simple working throughout with BCD can lead to a simpler overall system than converting to binary.

The same argument applies when hardware of this type uses an embedded microcontroller or other small processor. Often, smaller code results when representing numbers internally in BCD format, since a conversion from or to binary representation can be expensive on such limited processors. For these applications, some small processors feature BCD arithmetic modes, which assist when writing routines that manipulate BCD quantities

The comparisons below give a little more depth comparing BCD to binary.  Many of these reasons are beyond what you need to know for the exam.

Comparison with pure binary

Advantages

Disadvantages

Extract from Wikipedia link given above.