Representing Text

Course Content Specification

As we have already discussed a computer can only store binary digits (bits) of 0 or 1’s. It has to use the same method to store text as well as numbers. To store text it assigns each piece of text with a binary code known as an ASCII Code.

ASCII Codes

ASCII stands for the American Standard Code for Information Interchange. It was originally designed for use by teletype machines, but was adopted for use in computers. It was originally a 7 bit code but IBM began to introduce an 8 bit ASCII standard which became known as extended ASCII. The left most bit being used for error checking. Extended ASCII formed the basis of the ISO 8859 standard. Until 2008 the bulk of the pages on the web were encoded in ASCII. This has now been superseded by UTF-8  (Unicode) encoded pages (further reading).

ASCII works by assigning a numerical value to each text character used by the computer (which is then stored in binary). A single ASCII character takes up one byte of storage.

For example: A => ASCII Value 65 – Stored as 0100 0001

ASCII Table Sample

A full ASCII table can be seen here

Control Characters

Shown above are some of the printable characters but not every character used by the ASCII table is printable. For example the Delete key ( ASCII Code 127) does not display on the screen but the computer has to be able to detect when it has been pressed. The delete key is an example of a control character.

These control characters are non–printable characters that have an effect on screen such as the tab key, enter key etc.

Character Sets

The entire set of characters that the computer can represent is known as the character set. There will be different character sets in use by different countries.

Disadvantages of ASCII

There are problems with ASCII particularly with the advent of the Internet there is communication between people over the globe. This raises the question of how to represent the myriad of languages and symbols that are used by those languages?

ASCII is only an 8 bit code. Which means a maximum of 255 characters can be represented. However if we look at the Chinese language there are over 3,500 commonly used characters. None of which are represented by the ASCII code.

To overcome this limitation the Unicode method of representation is used.

こんにちわ = Konnichiwa- A Japanese greeting

Unicode

Unicode is an alternate method of representing text. This means that there is a maximum of 65,536 characters that can be represented using this notation. ASCII still forms the basis for this with the first 128 characters being common to the ASCII and Unicode character sets.

ASCII is only an 8 bit code. Which means a maximum of 255 characters can be represented. However if we look at the Japanese language there are approximately 50,000 characters although just over 2000 are on the Japanese Governments recommended list (SJLFAQ,.). None of which are represented by the ASCII code.

The Unicode standard is now in Version 15.1 (September 2023) and is constantly being updated, the most recent version even added 627 new characters (Unicode), this brings the total amount of characters that are represented to over 149,813. 

Unicode Formats

The most common encodings are Unicode Transmission Format (UTF) 

But there is a 32 bit encoding - UTF 32

The image below shows that as of 2012 UTF-8 encoding is now used to represent the majority (over 60%) of pages on the web. A webpage showing Unicode characters can be found here.

This means that a much larger variety of languages and characters can be represented.

Advantages and Disadvantages of Unicode

Can represent more symbols.

Can represent more languages.

Uses more storage space per character

Lesson Video

Storing Text.MP4