Essentials of charatcer encoding

The essentials of character encoding

From earlier sections in this course, you know that computers do not store digital media as letters, numbers, sounds, and pictures. Instead, computers work with bits, binary digits that have a value of 1 or 0 (on or off). To get from these bits to all the things you see thanks to computers, characters (for example letters) shown on-screen need to be encoded. Here, we will go through a little history of how computers encode language.

To understand character encoding, let’s go back to 1836, when Morse code was invented. Morse code uses telegraph technology to represent data electronically with switches over long distances. Instead of using ones and zeros, however, Morse code uses dots ., dashes -, and pauses, all of which we can describe as “symbols”. For example, to say “hello” in Morse code, you would use this code:

Character Morse code

….

.-..

- - -

We have come a long way since representing letters in Morse code, but the principles are still very similar: we encode information in different ways to represent data and bring it to life with computers. Let’s look at how we have progressed from Morse code to all the text you see on displays and screens.

Bits and bytes

If you look at the Morse code representation of ‘HELLO’, you can see different lengths of the codes: H and L are encoded by four symbols, E by one symbol, and O by three symbols. This is OK if you want human operators to understand the code, such as the people who send messages by telegraph. However, when you need alphabets or characters to be processed mechanically by a computer, you can run into issues with so many variations in code length.

In 1874, to fix this problem, Émile Baudot invented a five-bit code to represent characters. With this code, you can get up to 60 characters encoded using the keyboard below. Beside it is an example of the Baudot code that was produced by this keyboard.

However, being able to encode 60 characters is not enough to cover characters like uppercase letters and some numbers.

Because Baudot code represents each character by five bits, five bits was an important measure — it’s the 1874 version of a byte. In fact, the term ‘byte’ wasn’t defined until 1956, and by then, computers like the IBM Stretch were designed to represent data using a maximum of eight bits. This is why a byte means eight bits!

To this day, a byte is still considered to be eight bits. And as computer hardware become more complex, people created new computer architecture that easily built upon existing technology by doubling the current bit architecture and making sure it always remained a power of 2. Thus, today we have 64-bit CPUs and some character encoding methods that are 32 bits. In the next section, we will look at how ASCII represents characters by using up to 8 bits, or a byte, of data.

Representing larger amounts of data

A byte of data is defined as 8 bits, and because many files stored on computers are much bigger than this, indicating file size in the number of bytes would be unwieldy and hard to comprehend. Instead we use prefixes before ‘byte’ to represent larger numbers. Because computers use binary, each of the prefixes represents a value that is 2 to the power of 10 = 1024 times the previous one.

Value Equal to In bytes

1 kilobyte (KB)

1024 bytes

1 megabyte (MB)

1024 kilobytes

1048576 bytes

1 gigabyte (GB)

1024 megabytes

1073741824 bytes

1 terabyte (TB)

1024 gigabytes

1099511627776 bytes

1 petabyte (PB)

1024 terabytes

1125899906842624 bytes

To convert between these, you need to divide by 1024 if you want to go down a step in the table to a larger prefix. For example, 256000 bytes would be 256000/1024 = 250KB. Working this out in megabytes gives 250/1024 = 0.244MB.

To go up a step in the table to a lower prefix, you need to multiply by 1024. So a three-terabyte hard drive can store 3 * 1024GB = 3072GB, or 3072 * 1024MB = 3145728MB.

These prefix values often lead to confusion, because elsewhere (in scientific subjects, for example), ‘kilo’ means 1000 times, not 1024 times; ‘giga’ means 1000000 times, not 1048576 times; and so on. To try to prevent confusion, the units representing steps of 1024 are sometimes called kibibyte, mebibyte, gibibyte, tebibyte, pebibyte, and so on. However, these terms are not always consistently used.

Page updated

Report abuse