Representing Text & Lossless Compression

Text and ASCII

In modern computers, text is represented using Unicode. In particular, the UTF-8 encoding is popular and for English text, is basically the old ASCII encoding that you are required to know about for this course. For a quick explanation, watch the Unicode Miracle video.

ASCII stands for American Standard Code for Information Interchange and it was standardised in the 1960s. The original ASCII uses 7 bits to give 128 characters that include the English alphabet, numerals and punctuation. This was extended to an 8 bit encoding with "modern" microcomputers in the 70s and 80s.

Origins of ASCII, Unicode & UTF8

Older Encodings were used, e.g.

You can generate your own table in Python with code like:

for i in range(32, 128): print(i, "=", hex(i)[2:].upper(), " -> ", chr(i))

You should know that the digits 0-9, capitals A-Z and lowercase a-z are in sequences. Punctuation is relatively randomly scattered in the gaps.

00NNNNN are control characters (from the old teletypes). My favourite is 07₁₆ = 0000111₂...
01NNNNN are the first set of printable characters
- 0110000 = 0, 0110001 = 1, ... 0111001 = 9.
10NNNNN are the capital letters
- 1000001 = A, 1000010 = B, ... 1011010 = Z.
11NNNNN are the lowercase letters
- 1100001 = a, 1100010 = b, ... 1111010 = z.
- Just flip the 2nd bit on the left (add/subtract 32) to switch between capitals and lowercase!

You see ASCII character codes represented in Hex when a non ASCII or a protected symbol is used in an URL. E.g., example.com/products%20and%20services.html. See URL Percent Encoding for more details.

Character arithmetic: Using the fact that the alphabet is stored in sequence, you can add or subtract numbers to the character codes to move around the alphabet. This is quite common in exam questions. E.g.,

Q: Given the ASCII character code for ‘A’ is 65. What is the 7-bit binary representation for the character ‘H’.
Ans: A = 1000001, H is 8th letter, so H = 1001000

Lossless Compression

Text (including source code, vector graphics, etc...) must be stored using lossless compression, otherwise it would look garbled when uncompressed. The type of compression that you need to know for the IGCSE is:

Dictionary based lossless compression - see for example Text Compression @ Code.org
- Replace common character sequences with a dictionary reference. Have to store the collapsed text AND the dictionary.
LZ Compression - see CS Field Guide - General Purpose Compression
- Replace common character sequences with a pointer back to the first occurrence (roughly speaking!)

You need to describe how such compression schemes work in exam-style questions.

Extension: Huffman coding is a way of generating a variable length encoding for text (and other data) that will lead to lossless compression. This is in the AQA equivalent course to ours! See the CS Field Guide link above for more detail.

Further Resources

BBC Bitesize: Character Sets & ASCII

Report abuse