Representing Text

Course Content Specification

As we have already discussed a computer can only store binary digits (bits) of 0 or 1’s. It has to use the same method to store text as well as numbers. To store text it assigns each piece of text with a binary code known as an ASCII Code.

ASCII Codes

ASCII stands for the American Standard Code for Information Interchange. It was originally designed for use by teletype machines, but was adopted for use in computers. It was originally a 7 bit code but IBM began to introduce an 8 bit ASCII standard which became known as Extended ASCII. The left most bit being used for error checking. Until 2008 the bulk of the pages on the web were encoded in ASCII. This has now been superseded by UTF-8  (Unicode) encoded pages (further reading).

ASCII works by assigning a numerical value to each text character used by the computer (which is then stored in binary). A single ASCII character takes up one byte of storageFor example: A => ASCII Value 65 – Stored as 0100 0001

ASCII Table Sample

A full ASCII table can be seen here

Control Characters

Shown above are some of the printable characters but not every character used by the ASCII table is printable. For example the Delete key ( ASCII Code 127) does not display on the screen but the computer has to be able to detect when it has been pressed. The delete key is an example of a control character.

These control characters are non–printable characters that have an effect on screen such as the tab key, enter key etc.

Character Sets

The entire set of characters that the computer can represent is known as the character set. There will be different character sets in use by different countries.

Disadvantages of ASCII

There are problems with ASCII particularly with the advent of the Internet there is communication between people over the globe. This raises the question of how to represent the myriad of languages and symbols that are used by those languages?

ASCII is only an 8 bit code. Which means a maximum of 255 characters can be represented. However if we look at the Chinese language there are over 3,500 commonly used characters. None of which are represented by the ASCII code.

To overcome this limitation the Unicode method of representation is used, although this is covered at Higher.

こんにちわ = Konnichiwa:  a Japanese greeting

Unicode

Although not covered at National 5, Unicode is an alternate method of representing text. This means that there is a maximum of 65,536 characters that can be represented using this notation. ASCII still forms the basis for this with the first 128 characters being common to the ASCII and Unicode character sets.

ASCII is only an 8 bit code. Which means a maximum of 255 characters can be represented. However if we look at the Japanese language there are approximately 50,000 characters although just over 2000 are on the Japanese Governments recommended list (SJLFAQ). None of which are represented by the ASCII code. 

There are various sizes of Unicode characters from 8 bit to 32 bit. The image below shows that as of 2012 Unicode is now used to represent the majority (over 60%) of pages on the web. 

A webpage showing Unicode characters can be found here.

This means that a much larger variety of languages and characters can be represented.

Advantages and Disadvantages of Unicode

Can represent more symbols.

Can represent more languages.

Uses more storage space per character

Lesson Video

1.3 Storing Text.mp4