1.1 Data Representation

Files

Specification

Show understanding of binary magnitudes and the difference between binary prefixes and decimal prefixes
- Understand the difference between and use: kibi and kilo; mebi and mega; gibi and giga; tebi and tera
Show understanding of the basis of different number systems
- Use the binary, denary, hexadecimal number bases and Binary Coded Decimal (BCD) and one's and two's complement representation for binary numbers
- Convert an integer value from one number base / representation to another
Perform binary addition and subtraction
- Using positive and negative binary integers
- Show understanding of how overflow can occur
Describe practical applications where Binary Coded Decimal (BCD) and Hexadecimal are used
Show understanding of and be able to represent character data in its internal binary form, depending on the character set used
- Familiar with ASCII (American Standard Code for Information Interchange), extended ASCII and Unicode. (Students will not be expected to memorise any particular character codes.)

Number Systems

Like with the SI system of number representation, computer systems also have a hierarchical system of representing bits. A bit is the smallest unit that can be read and understood by a computer. Each successive unit comprises of two or more bits.

A word, represented here as two bytes, is actually dependent on the operating system and processor. In short, it is the largest possible value that can be transferred to the processor in one clock cycle.

Bytes use the SI system (kilo, mega, etc), but because binary is base 2, each goes up in units of 1024, not 1000. There was, and still is, large confusion when using SI units in computing because some people will assume a kilobyte is 1000 bytes, while others, correctly, assume 1024 bytes in a kilobyte.

Too avoid future conversion, the International Electrotechnical Commission (IEC) put forward a new system of unit representation, which is slowly catching on. Under this new system the SI units are considered base 10, while the new units are always base 2 (i.e. factors of 2, not 10).

For clarification, I will always refer to, and calculate, the SI units as base 2.

Number Base Conversion

A number base is simply how many symbols are available in that number system. Binary can be either 0 or 1, which makes it base 2. Decimal (denary) is base 10 because we have symbols 0 to 9. Hexadecimal is base 16 because we have 0 to 9 and A, B, C, D, E and F.

Computer systems use binary, humans use denary and when expressing large binary values, hexadecimal (often called hex) is better. Because there are different number bases, we often need to convert between them.

The sections below break down the different bases, but this Wikibooks website gives a comprehensive background to what you need to know.

There are two methods which work no matter what the number system, when converting either from or to denary.

Converting Between Units

When converting from a larger unit to a smaller unit, you will always need to multiply, and divide when going the other way. E.g.

3.4 GB expressed in MB is 3.4 x 1024 = 3,481.6 MB
2.8 Mb expressed in MB is 2.8 ÷ 8 = 0.35 MB, or 358.4 KB.

[Number base] TO Denary

The video to the right shows how to convert from any base to base 10 (denary/decimal). Remember to look at the wiki books link (above) for a comprehensive overview of all number base calculations.

This wiki page explains positional notation.

Denary TO [Number base]

The video on the right (linked to the one above) shows a universal method of converting from base 10. There are other methods, such as using subtraction, to achieve this, but the method below is preferred.

This wiki page explains positional notation.

Binary TO Hexadecimal (and vice versa)

This is possibly the easiest conversion of all. Because a single hexadecimal digit is exactly 4 bits, you simply can do a lookup. Look at the table below.

TIP: Drawing the table below is very simple when you see the binary pattern. As each column increases in powers of two (starting 1, 2, 4, 8 etc) you simply repeat the pattern that many times.

The decimal equivalent is shown here for clarification, it is not needed when converting.

Given the hex value D3F, to convert this to binary we simply look up the binary for D, 3 and F.

D = 1101

3 = 0011

F = 1111

Therefore D3F = 1101 0011 1111 in binary.

When converting from binary, always start at the least significant bit and group in 4s. if there are only say 2 bits left, pad with two extra 0s. E.g.

101 1011 1001

5 B 9

Notice how the 101 was read 0101.

Full Number Base Chart

Test Your Knowledge

Here are some websites and games to ensure you understand this content. Make your own too.

Interactive binary game

Cisco binary game

Why Binary?

Two's Complement

Two's complement is a modification of one's complement, which uses the most significant bit (MSB) as a sign indicator. Remember that unsigned binary numbers are neither positive or negative. Two's complement improves on one's complement by removing the possibility of a +0 and -0.

Two's complement is not complicated. Two's complement is an extremely popular way for computers to represent integers. Some floating-point (storing fractions) methods also use two's complement. To get the two's complement negative notation of an integer, you write out the number in binary. You then invert the digits, and add one to the result. When writing out the binary number, always add at necessary trailing 0 to the number before converting to two's complement (exam questions will always give a certain number of bits in which the number MUST be represented). Also note that the two's complement process ONLY need be followed when dealing with negatives.

An exception for largest negative value

Unfortunately, due to the way two's complement ensures an even number of odd and even numbers, there is an exception for the largest negative value. This is the only exception and the usual method of inverting a flipping doesn't really apply in a traditional sense. I.E. the largest negative value in a 5 bit two's complement is 10000 (-16) -- noting that there is NO equivalent as the positive numbers range 0 to 15.

Converting To Two's Complement

Example 1:

Let's take an example 8 bit storage and suppose we want to find how -35 would be expressed in two's complement notation. First we write out 35 in binary form:

00100011

Then we invert the digits. 0 becomes 1, 1 becomes 0.

11011100

Then we add 1:

11011101

That is how one would write -35 in 8 bit binary.

You might notice that, using two's complement, the lowest and highest values that can be stores are -128 and +127, this is because in the above example, the 8th bit is used for the sign. To store > 127 we need to increase the number of bits.

Example 2:

Let's take an example that can confuse students, again using 8-bit: -128. First write 128 in binary, which is:

10000000

Now invert the digits:

01111111

Now we add 1:

10000000

Until students get used to two's complement, they feel this calculation should be 11111111, which in fact is -1. +127 is 01111111.

Converting From Two's Complement

To convert from two's complement to decimal, you use exactly the same process, as long as it's a NEGATIVE (i.e. 1 as the MSB).

Example 1:

Let's take this 8 bit two's complement:

10001110.

First invert the digits:

01110001

Add 1:

01110010

Remembering that this was originally a negative, converted to decimal is -114.

Other Benefits

A major benefit of two's complement is subtracting. With two's complement, we only have to add the two's complement of a number to subtract.

Example 1:

Let's take 12 - 12. 12 and -12 in 8-bit binary are:

12: 00001100

-12: 11110100

Added together are:

00000000, which is 0 (ignoring the overflow)

Example 2:

Let's take another example: 105 - 62

We convert -62 into two's complement and add together as shown below:

105: 01101001

-62: 11000010

Added together give:

00101011 = 43

Example 3:

Finally, let's calculate 43 - 87:

43: 00101011

-87: 10101001

Added together give:

11010100 = -44

This number is a negative, so it was changed back using two's complement conversion.

Storing a negative number in HEX

To convert a negative number to hexadecimal, you need to first convert the binary number to two's complement, and then follow the process below.

Two's complement representation to HEX

To convert a binary two's complement to hex, you calculate what the binary value would be (including the correct power 2 for the MSB) and store that as hex. (you can also use the same nibble shortcuts to convert as in any other bin --> hex conversion)

Negative HEX to Decimal

The question will need to state which method is being used (e.g. two's complement). You then convert that hexadecimal number to binary, invert the bits, add 1 and then calculate the value of the binary (remembering that this needs to be signed negative).

Two's Complement Calculator

Here is a two's complement converter so you can check your calculations.

Character Sets

Character sets are defined in terms of encoding mechanisms. For the exam, you are expected to understand ASCII and Unicode. You DO NOT need to know individual characters (e.g. A=65), but you should be able to explain how the systems work and encode/decode if necessary. The encoding mechanism simply dictates the rules from translating the string of characters into a sequence of bytes. Encoding in computing terms is simply the process of taking a piece of data and deciding how to store it on a computer.

Note: The character set does not determine how it is displayed or printed. Each character is mapped to a particular glyph (think 'image' - usually a vector set of coordinates) in a given font. A common font system is TrueType (designed by Apple). It is the font that determines how, for example, character 65 is printed. This is why in some fonts, 65 might be a musical note or other glyph (e.g. using Wingdings).

ASCII

The American Standard Code for Information Interchange, or ASCII for short, is a 7/8-bit character set used in English speaking countries. Its use has mostly been replaced with Unicode (predominantly UTF-8) but is still used today for simple applications. The original ASCII codes used 7-bits but this was later extended to use 8-bits (called extended ASCII). The table below shows the standard 7-bit ASCII symbols. Note: 0-31 are not printable characters, but control codes). When viewing an ASCII file, not all characters will be displayed. There are characters such as end of line, tab and carriage return characters that affect the formatting without being obvious. This Wikibook page gives a little more detail along with sample exam questions.

Click on the table to view the original image.

The following video explains ASCII in simple terms. Feel free to jump around the video as some parts are either irrelevant or long-winded.

UNICODE

With intern-connected computers and eventually the Internet, there grew a need to have a universal encoding scheme that could be used by all countries, without having to deal with the hundreds of local encoding schemes that grew out of the computing boom. A group was established to create such a universal encoding scheme and it was called Unicode. The simple aim, to create a gigantic table of all possible glyphs and assign each a unique number. This is, in essence, what Unicode is. However, and confusingly, Unicode doesn't specify how each character is stored on a computer, this is where Unicode transformation format (UTF) encoding schemes are used. There are three main ones, UTF-8; UTF-16 and UTF-32.

When talking about character encoding (E.g. ASCII or Unicode) we are referring to the storage of code points. A code point is a value that represents a given character, unique from the others. ASCII comprises 128 code points from 0 to 127 (7Fhex). Extended ASCII ranges from 0 to 255 (FFhex). Unicode comprises 1,114,112 code points ranging from 0 to 1,114,112 (10FFFFhex). Code points are usually denoted in base 16 using the format U+F4A which represents the decimal value 3,914 (A Tibetan letter ཊ)

An excellent place to start, when looking at Unicode is this excellent article. It explains Unicode in simple terms, and in more than enough detail than needed for your exam. The article explains UTF-8 (Unicode Transformation Format 8-bit), which is different from UTF-16 and UTF-32.

Exam Note

CIE advise that for the exam that "Candidates need to know the most common Unicode standard(s). They must be able to explain what Unicode is and why it is used instead of ASCII.". My suggestion is that you familiarise yourself with UTF-8 and take a passing glance at UTF-16 and UTF-32. Do not worry about other formats such as UTF-16LE, etc.

This video Computerphile video by Tom Scott explains the advance from ASCII to UTF-8.

This Wikibooks page will give you an overview of Unicode along with some questions which could come up in the examination.

Unicode Extra Detail

Firstly, this Microsoft blog document gives some good background on ANSI/ASCII and UNICODE.

In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes. UTF-8 uses the following rules:

1. If the code point is < 128, it’s represented by the corresponding byte value.
2. If the code point is >= 128, it’s turned into a sequence of two, three, or four bytes, where each byte of the sequence preceeds the data part with 10. E.g. 10vvvvvvvv.

Picture taken from Joelon Software.

UTF-8 has several convenient properties:

1. It can handle any Unicode code point.
2. A Unicode string is turned into a string of bytes containing no embedded zero bytes. This avoids byte-ordering issues, and means UTF-8 strings can be sent through protocols that can’t handle zero bytes.
3. A string of ASCII text is also valid UTF-8 text.
4. UTF-8 is fairly compact; the majority of commonly used characters can be represented with one or two bytes.
5. If bytes are corrupted or lost, it’s possible to determine the start of the next UTF-8-encoded code point and resynchronize. It’s also unlikely that random 8-bit data will look like valid UTF-8.

This somewhat humorous article gives more detail relating to Unicode. The content below is an extract from this article.

There Ain't No Such Thing As Plain Text.

If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly.

Almost every stupid "my website looks like gibberish" or "she can't read my emails when I use accents" problem comes down to one naive programmer who didn't understand the simple fact that if you don't tell me whether a particular string is encoded using UTF-8 or ASCII or ISO 8859-1 (Latin 1) or Windows 1252 (Western European), you simply cannot display it correctly or even figure out where it ends. There are over a hundred encodings and above code point 127, all bets are off.

How do we preserve this information about what encoding a string uses? Well, there are standard ways to do this. For an email message, you are expected to have a string in the header of the form

Content-Type: text/plain; charset="UTF-8"

For a web page, the original idea was that the web server would return a similar Content-Type http header along with the web page itself -- not in the HTML itself, but as one of the response headers that are sent before the HTML page.

This causes problems. Suppose you have a big web server with lots of sites and hundreds of pages contributed by lots of people in lots of different languages and all using whatever encoding their copy of Microsoft FrontPage saw fit to generate. The web server itself wouldn't really know what encoding each file was written in, so it couldn't send the Content-Type header.

It would be convenient if you could put the Content-Type of the HTML file right in the HTML file itself, using some kind of special tag. Of course this drove purists crazy... how can you read the HTML file until you know what encoding it's in?! Luckily, almost every encoding in common use does the same thing with characters between 32 and 127, so you can always get this far on the HTML page without starting to use funny letters:

<html>

< head>

But that meta tag really has to be the very first thing in the <head> section because as soon as the web browser sees this tag it's going to stop parsing the page and start over after reinterpreting the whole page using the encoding you specified.

Binary Coded Decimal (BCD)

Binary coded decimal or BCD is a binary encoding scheme where each denary/decimal digit is represented by its own binary pattern (usually set at 4 or 8 bits). BCD is useful when displaying data to users using segmented displays (e.g. old calculators before LED/OLED displays became popular). The circuitry needed to display each digit is made easier when the number is stored in BCD. BCD is not ideal however, as there is wastage when we only need to display numbers (0-9). Taking a 4-bit encoded pattern, we only ever use from 0000 to 1001.

To find out more about BCD, here is a link to the Learning Electronics site. In addition, this Wikipedia entry gives more detail regarding BCD. An extract from this entry is given at the end of this section.

Converting From/To BCD

Converting between BCD and standard binary is straight forward, if you are happy to convert the long way.

Note: The choice of which method to use is yours. Both methods can be implemented in code, but the shift +3 method is far more efficient and simpler to implement with electronics. For the exam, you simply need to convert between the two binary representations. No particular method is expected or assumed.

http://hyperphysics.phy-astr.gsu.edu/hbase/electronic/number3.html

Easy Method

To convert from BCD to binary:

1. Determine the encoding structure. E.g. 4 bits
2. Separate each group of bits. E.g. 110001101111 becomes 1100 0110 1111
3. Convert each group to decimal using the method you already know
4. Concatenate the numbers (do not add)
5. Convert this new number to binary.

To convert from binary to BCD

1. Convert the binary to decimal (ignore if already given in decimal/denary)
2. Take each individual digit and convert to binary using the correct number of bits. If using 4 bits, for example, the number must pad to 4 bits. E.g. 3 would become 0011
3. Concatenate the bits

More Efficient Method

The more efficient method is not as straightforward, but is much more easily implemented in code. Rather than duplicate the method, here are some web resources that explain both conversion types.

The conversion from BCD to binary uses a similar method. Here is is explained in text.

Applications

BCD is very common in electronic systems where a numeric value is to be displayed, especially in systems consisting solely of digital logic, and not containing a microprocessor. By utilising BCD, the manipulation of numerical data for display can be greatly simplified by treating each digit as a separate single sub-circuit. This matches much more closely the physical reality of display hardware—a designer might choose to use a series of separate identical seven-segment displays to build a metering circuit, for example. If the numeric quantity were stored and manipulated as pure binary, interfacing to such a display would require complex circuitry. Therefore, in cases where the calculations are relatively simple working throughout with BCD can lead to a simpler overall system than converting to binary.

The same argument applies when hardware of this type uses an embedded microcontroller or other small processor. Often, smaller code results when representing numbers internally in BCD format, since a conversion from or to binary representation can be expensive on such limited processors. For these applications, some small processors feature BCD arithmetic modes, which assist when writing routines that manipulate BCD quantities

The comparisons below give a little more depth comparing BCD to binary. Many of these reasons are beyond what you need to know for the exam.

Comparison with pure binary

Advantages

Many non-integral values, such as decimal 0.2, have an infinite place-value representation in binary (.001100110011...) but have a finite place-value in binary-coded decimal (0.0010). Consequently a system based on binary-coded decimal representations of decimal fractions avoids errors representing and calculating such values.
Scaling by a factor of 10 (or a power of 10) is simple; this is useful when a decimal scaling factor is needed to represent a non-integer quantity (e.g., in financial calculations)
Rounding at a decimal digit boundary is simpler. Addition and subtraction in decimal does not require rounding.
Alignment of two decimal numbers (for example 1.3 + 27.08) is a simple, exact, shift.
Conversion to a character form or for display (e.g., to a text-based format such as XML, or to drive signals for a seven-segment display) is a simple per-digit mapping, and can be done in linear (O(n)) time. Conversion from pure binary involves relatively complex logic that spans digits, and for large numbers no linear-time conversion algorithm is known (see Binary numeral system).

Disadvantages

Some operations are more complex to implement. Adders require extra logic to cause them to wrap and generate a carry early. 15–20 percent more circuitry is needed for BCD add compared to pure binary. Multiplication requires the use of algorithms that are somewhat more complex than shift-mask-add (a binary multiplication, requiring binary shifts and adds or the equivalent, per-digit or group of digits is required)
Standard BCD requires four bits per digit, roughly 20 percent more space than a binary encoding (the ratio of 4 bits to log210 bits is 1.204). When packed so that three digits are encoded in ten bits, the storage overhead is greatly reduced, at the expense of an encoding that is unaligned with the 8-bit byte boundaries common on existing hardware, resulting in slower implementations on these systems.
Practical existing implementations of BCD are typically slower than operations on binary representations, especially on embedded systems,[citation needed] due to limited processor support for native BCD operations.

Extract from Wikipedia link given above.