Character Encoding

Character encoding now

Two character encoding standards define how characters are decoded from ones and zeros into the text you see on the screen right now, and into the different languages viewed every day on the World Wide Web. These two encoding standards are ASCII and Unicode.

ASCII

The American Standard Code for Information Interchange (ASCII) was developed to create an international standard for encoding the Latin alphabet. In 1963, ASCII was adopted so information could be interpreted between computers; representing lower and upper letters, numbers, symbols, and some commands. Because ASCII is encoded using ones and zeros, the base 2 number system, it uses seven bits. Seven bits allow 2 to the power of 7 = 128 possible combinations of digits to encode a character.

ASCII therefore made sure that 128 important characters could be encoded:

This table can be downloaded as a PDF at the end of the step

How encoding ASCII works

You already know how to convert between denary and binary numbers
You now need to turn letters into binary numbers
Every character has a corresponding denary number (for example, A → 65)
ASCII uses 7 bits
We use the first 7 columns of the conversion table to create 128 different numbers (from 0 to 127)

For example, 1000001 gives us the number 65 (64 + 1), which corresponds to the letter ‘A’.

64 32 16 8 4 2 1

Here’s how ‘HELLO’ is encoded in ASCII in binary:

Latin character ASCII

1001000

1000101

1001100

1001111

Let’s apply this theory in practice:

Open Notepad, or whichever plain text editor you prefer
Type a message and save it, e.g. ‘data is beautiful’
Look at the size of the file — mine is 18 bytes
Now, add another word, e.g. ‘data is SO beautiful’
If you look at the file size again, you’ll see that it has changed — my file is now 3 bytes larger (SO[SPACE]: the ‘S’, the ‘O’, and the space)

Unicode and UTF-8

Because ASCII encodes characters in 7 bits, moving to 8-bit computing technology meant there was one extra bit to be used. With this extra digit, Extended ASCII encoded up to 256 characters. However, the problem that developed was that countries that used different languages did different things with this extra capacity for encoding. Many countries added their own additional characters, and different numbers represented different characters in different languages. Japan even created multiple systems of encoding Japanese depending on the hardware, and all of these methods were incompatible with each other. So when a message was sent from one computer to another, the received message could become garbled and unreadable; the Japanese character encoding systems were so complex that even when a message was sent from one type of Japanese computer to another, something called ‘Mojibake’ would happen:

The problem of incompatible encoding systems became more urgent with the invention of the World Wide Web, as people shared digital documents all over the world, using multiple languages. To address the issue, the Unicode Consortium established a universal encoding system called Unicode. Unicode encodes more than 100000 characters, covering all the characters you would find in most languages. Unicode assigns each characters a specific number, not to a binary digit. But there were some issues with this, for example:

To encode 100000 characters, around 32 binary digits would be required. Unicode uses ASCII for the English language, so A is still 65. However, encoded in 32 bits, the binary representation for the letter A would be 000000000000000000000000000000000001000001. This wastes a lot of valuable space!
Many older computers interpret eight zeros in a row (a null) as the end of a string of characters. So these computers wouldn’t send any characters that came after eight zeros in a row (they wouldn’t send an A if it was represented as 000000000000000000000000000000000001000001).

The Unicode encoding method UTF-8 solves these problems: - Up to character number 128, the regular ASCII value is used (so for example A is 01000001) - For any character beyond 128, UTF-8 separates the code into two bytes and adding ‘110’ to the start of first byte to show that it is a beginning byte, and ‘10’ to the start of second byte to show that it follows the first byte.

So, for each character beyond number 128, you have two bytes:

[110xxxxx] [10xxxxxx]

And you just fill in the binary for the number in between:

[11000101] [10000101] (that's the number 325 → 00101000101)

This works for the first 4096 characters. For characters beyond that, one more ‘1’ is added at the beginning of the first byte and a third byte is also used:

[1110xxxx] [10xxxxxx] [10xxxxxx]

This gives you 16 spaces for binary code. In this manner, UTF-8 goes up to four bytes:

[11110xxx] [10xxxxxx] [10xxxxxx] [10xxxxxx]

In this way, UTF-8 avoids the problems mentioned above as well as needing an index, and it lets you decode characters from the binary form backwards (i.e. it is backwards-compatible).

Activites in class

There are many fun activities for teaching character encoding. We have included two exercises below for you to try in your classroom. What top tips do you have for teaching character encoding? Share them in the comments!

Translating secret messages: post a short secret message in ASCII the comment section, and translate or respond to other participants’ ASCII messages
Binary bracelets: create bracelets using different coloured beads to represent ones and zeros and spell out an initial or a name in ASCII

Page updated

Report abuse