Data Representation

Numbers in Binary and Denary

You will need to convert 16 bit numbers between Denary & Binary and Binary numbers to and from Hex

Methods

Binary to Denary: just add up the powers of 2.
Denary to Binary: Repeated subtraction of highest power of 2 OR repeated division by 2 to get remainders
Binary to Hex: groups of four starting from the least significant bit

Numbers in Binary and Denary

Binary-to-Denary Conversion

Denary-to-Binary Conversion

Hexadecimal

Text and ASCII

Unicode and UTF-8

UTF-8 Example in Python

Images

Vector Graphics

Bitmap Images - Uncompressed (.bmp)

Bitmap Compression

Hex Color Codes

Activity

Extension Material Below

Audio

Midi (Musical Instrument Digital Interface)

Sampled Sound & Music

Resources

Activity

Binary-to-Denary Conversion

Image from: instructables.com/id/Converting-Decimal-To-Binary-Numbers/

Denary-to-Binary Conversion

Hexadecimal

Can easily move between hexadecimal and binary, then use the previous techniques to convert between denary. E.g.,

Image from: wikibooks.org/wiki/A-level_Computing/AQA/Paper_2/Number_bases
Alternatively, you can directly calculate the result: 5F₁₆ = 5*16₁₀ + 15₁₀ = 80 + 15 = 95₁₀

Resources

Worksheets handed out in class
BBC Bitesize: Introduction to Binary
Odometer Widget (code.org)
Binary Game (Cisco / code.org)
Crash Course CS Ep 4 - Representing Numbers and Letters with Binary

Text and ASCII

In modern computers, text is represented using Unicode. In particular, the UTF-8 encoding is popular and for English text, is basically the old ASCII encoding that used to be standard. For a quick explanation, watch the Unicode Miracle video.

ASCII stands for American Standard Code for Information Interchange and it was standardised in the 1960s. The original ASCII uses 7 bits to give 128 characters that include the English alphabet, numerals and punctuation. This was extended to an 8 bit (one byte) encoding with "modern" microcomputers in the 70s and 80s.

Older codes include EBCDIC and Morse Code.

ASCII, Unicode & UTF-8

Note: Max Number of UTF-8 bytes is 4 (but the original specification allowed for up to 6 bytes)

Extension: Older Encodings

You can generate your own table in Python with code like:

for i in range(32, 128): print(i, "=", hex(i)[2:].upper(), " -> ", chr(i))

You should know that the digits 0-9, capitals A-Z and lowercase a-z are in sequences. Punctuation is relatively randomly scattered in the gaps.

00NNNNN are control characters (from the old teletypes). My favourite is 07₁₆ = 0000111₂...
01NNNNN are the first set of printable characters
- 0110000 = 0, 0110001 = 1, ... 0111001 = 9.
10NNNNN are the capital letters
- 1000001 = A, 1000010 = B, ... 1011010 = Z.
11NNNNN are the lowercase letters
- 1100001 = a, 1100010 = b, ... 1111010 = z.
- Just flip the 2nd bit on the left (add/subtract 32) to switch between uppercase and lowercase!

You see character codes represented in Hex when a non-ASCII or a protected symbol is used in an URL. E.g., example.com/products%20and%20services.html. See URL Percent Encoding for more details.

Character arithmetic: Using the fact that the alphabet is stored in sequence, you can add or subtract numbers to the character codes to move around the alphabet. This is used in simple ciphers (TODO: Add link) and could also be an exam questions. E.g.,

Q: Given the ASCII character code for ‘A’ is 65. What is the 7-bit binary representation for the character ‘H’.
Ans: A = 1000001, H is 8th letter, so H = 1001000

Unicode and UTF-8

Unicode replaced the extended 8-bit ACSII as the international standard. It is managed by the Unicode Consortium. Unicode is basically just a list of symbols and glyphs that are assigned a number (code point) in the unicode table. The first 7 bits (128) are identical to ASCII, after that it includes symbols and accented characters from all modern and extinct languages, as well as symbols for drawing, linguistics, maths, science and emoji.

Note that not every character/glyph corresponds to a single unicode code-point: E.g., Combining Characters: Bear + Snowflake = Polar bear

The full Unicode (UTF-32) standard requires 32 bits to represent the largest code points. This is inefficient, so the standard is to use a variable byte representation called UTF-8.

UTF-8 for different numbers of Bytes:

1 Byte 0xxxxxxx - just 7 bit ASCII

2 Bytes 110xxxxx 10xxxxxx

3 Bytes 1110xxxx 10xxxxxx 10xxxxxx

4 Bytes 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

The header information is removed and the bolded bits above are joined to give a binary number corresponding to the Unicode symbol that you want.

UTF-8 Example in Python

>>> word = "Café"

>>> word.encode(encoding='utf-8')

b'Caf\xc3\xa9'

C3 A9 = 1100 0011 1010 1001

Remove the UTF-8 byte header information to get the bolded bits: 00011101001. We can convert this number to a character to double check:

>>> int("00011101001", 2)

233

>>> chr(233)

'é'

Images

Vector Graphics

These graphics are defined from combinations of mathematical shapes and curves. This allows them to be scaled up and down without loss of quality. Vector graphics are good for simple graphics & logos, but inefficient for photo-style graphics. Modern fonts also use vector graphics. Can be compressed using a lossless LZ-like compression.

Common file formats are: Scalable Vector Graphics (.svg) a modern web standard as an application of XML (like HTML); Adobe Illustrator (.ai); Postscript (.ps) - Adobe format, originally used for communicating to printers/plotters and is the ancestor of Portable Document Format (.pdf)

Image from Digitizing prints, bitmap vs vector

Bitmap Images - Uncompressed (.bmp)

Bitmapped images are stored as a two-dimensional grid of pixels. "True color" images use 24 bits per pixel (aka colour depth), 8 bits per RGB colour channel. This means that there is a simple calculation for the file size of an uncompressed image - it's basically just the volume of a box...

size in bits = (# of pixels) * (bits per pixel) = height * width * color depth

For example a 400*200 true color image:

size in kB = (400 * 200 px) * 24 bits/px * 1B/8bits * 1kB/1024B = 400*200*3/1024 kB = 234.4 kB

.bmp and .raw files are uncompressed bitmap images.

Bitmap Compression

Lossless compression: Bitmap graphics can be compressed using a LZ-like compression, but it does not compress things like photos very well. Portable Network Graphics (.png) files are usually a lossless format.

Lossy compression:

Reduce resolution - simple way to reduce image quality, but good quality to compression ratio
Reduce color space - GIFs (Graphics Interchange Format) work by using 8 bit color map instead of 24bit RGB for each pixel. This is lossy for most images. The GIF is then compressed using LZW to get further lossless compression.
JPEG (Joint Picture Expert Group) type compression - works by hiding information that the eye does not see very well. This is based on a colour space transformation and reduction, followed by a Discrete Cosine Transform and quantization of the resultant frequency space, finally a lossless Huffman-type encoding. JPEGs get a compression ratio of about 15:1, depending on image content and compression settings.
- See the Computerphile video for a more in depth description: JPG1 and JPG2

Hex Color Codes

"True color" in computers is where each pixel has 24 bits of color depth divided amongst the 3 colour channels Red, Green, and Blue - the three additive primary colours. That is, there is 1 Byte of information for each of RGB for each pixel. This is often written as 6 hex digits: #RRGGBB. For example. #FF0000 is bright red, #FFFF00 is bright yellow, #FF00FF is bright pink/magenta, #A8A8A8 is a light grey.

To approximately figure out what colour is represented by a RGB mix, place them on the rainbow / colour wheel. (Note that the concept of Blue in the 7 ROYGBIV colours due to Newton has shifted over the years)

Magenta R O Y G B I V Magenta

R G B

R+G = Orange to Yellow to Lime Green depending on the ratio
G+B = Cyan / Aqua
B+R = Hot Pink / Magenta

Note that the additive primary colours RGB used for mixing light are the opposite of the subtractive primary colours used for mixing paint and inks. Printers use CMYK, which stands for Cyan, Magenta, Yellow, Key/blacK.

Resources

BBC Bitesize: Encoding Images
Image file size calculator
There is no Pink Light (Minute Physics), There is no White Light (The Science Asylum)
Hex Color Game - Really Good practice!
C0FFEE is the Color, Color Name and Hue.
Colours & Maths Understanding the formulas of colour conversion

(last section is about HSL which is extension but interesting - the article is written by a designer at Shazam, not a programmer / mathematician. Here is the Shazam colour picker)

Activity

Create a simple vector graphic (SVG) using the following link https://editor.method.ac/
Look at the source code for the SVG (under the view menu)
Save as a SVG file to your computer and then use an online conversion tool to convert to BMP and JPG/PNG to compare size and quality of the graphics - zoom in to see the pixelisation and the compression artifacts.
Extension: Use photoshop to open a photo and then File-"Save for Web" to explore how reducing the resolution and colorspace effects the image.

Extension Material Below

Reading below this point may have unintended educational effects...

Audio

Midi (Musical Instrument Digital Interface)

Originally designed to be a way to connect digital instruments, but can also be used to store the sequences played/produced by a digital instrument. Midi is analogous to vector graphics and can be compressed using lossless compression.

Sampled Sound & Music

Sound is the oscillations in the air. These waves can be captured by microphones and turned into digital data using an Analogue to Digital Converter.

Each sequential measurement is assigned a number (in this case a nibble giving 0-15) according to its amplitude. The end result is a file comprising of a string of numbers, e.g., 1000, 1001, 1010, 1011, 1100, 1101, 1101, 1101, 1100, 1011, 1010, etc... Image from Planet Of Tunes "how do ADCs work?"

Resources

BBC Bitesize: Encoding Audio and Video
How to Geek - Comparison of Audio formats
Tom's Hardware article about audio codecs
Wikipedia comparison of audio formats
Audio File Size Calculator

Size of raw audio files (.wav, .aiff files)

The sample rate is how many samples taken per second. Often measured in Hertz (Hz = 1/sec). CD quality sound uses 44.1kHz.
The sample depth (or resolution or bit depth) is how many bits used for each sample (4 in the above image, 16 or 24 bits in CDs & DVDs)
Bit Rate = (Sample Rate) * (Sample Depth) is the number of bits required per second.
File size in bits = bit rate * length of recording

For example, 2 minutes of music sampled at 16000Hz with sample depth 8 bits.

Size = (120) sec * (16000 samples/sec) * (8 bits/sample) = 120*16000*8 bits = 120*16000 B = 1875 kB

Lossless Compression

FLAC (free lossless audio codec) and other lossless compression formats use things like Run-Length-Encoding, Linear Prediction, LZ Compression etc. These achieve about a 2:1 compression ratio on music.

Lossy Compression

Codecs such as mp3 (MPEG Audio Layer III), aac (advanced audio codec), wma (windows media audio), and ogg use psychoacoustics & perceptual music shaping to reduce the quality of sounds in ways that human listeners will not perceive. If two sounds play at the same time, often the softer one can be mostly removed. These codecs get about a 10:1 compression ratio.

Activity

Get a Midi from BitMidi (e.g., Super Mario Bros)
Look at its structure using Mid2Txt
Convert it to a Wav file & check its file size matches expectations (DO THE CALCULATION)
Convert it to a MP3 file and calculate the compression ratio c.f. the Wav file.

Report abuse