6.2 Data Integrity

Specification

  • Describe how data validation and data verification help protect the integrity of data

  • Describe and use methods of data validation

    • including range check, format check, length check, presence check, existence check, limit check, check digit

  • Describe and use methods of data verification during data entry and data transfer

    • During data entry including visual check, double entry

    • During data transfer including parity check (byte and block), checksum

Methods of validating and verifying data input

Introduction

As we enter data into any computerised system or database, it is very easy to enter data which isn't sensible. For example, we might enter a negative number for a field that asks us for our house number. There are a number of automatic ways that a computer can trap data which isn't sensible as it is entered. These are known as 'validation rules'. We can also double-check that the data we have entered (and which passed the validation rules) is sensible data. These double-checking techniques is called 'verification'. We can protect the integrity of data as it is entered into a system by using validation and verification techniques.

Data validation and data verification

Validation is the term used to describe the automatic checking of data as it is entered into any computer system, often a database or spreadsheet. The aim is to allow only sensible data to be entered. Sensible data means that it follows any validation rules set up for that field. Verification is checking that the sensible data you entered (which passed the validation rules set up for that data) is actually the data you meant to enter.

For example, if you had to enter the number of siblings into a program (the number of brothers and sisters you had), and the validation rule was 'Greater than or equal to zero and less than 10' you might enter 3. 3 is valid data because it obeys the validation rule for siblings. The problem is, you meant to enter 2, because you have two siblings not 3. You can only pick this problem up by a suitable verification technique.

We will look at a range of validation methods first and then some verification methods.

Range checks.

You can set up rules to ensure that if someone enters a number it must follow some mathematical rules, for example, be greater than a number, or less than or equal to a number or be between two numbers and so on. If a value is entered outside of this range, then a helpful error message should be displayed. For example:

      • Number of attempts < 3 (Number of attempts must be less than three).

      • Number of items <= 5 (Number of items must be less than or equal to five).

      • Minimum bet > 100 (The minimum bet must be great than one hundred).

      • Minimum bet >= 100 (The minimum bet must be greater than or equal to one hundred).

      • Score >= 0 AND <=180 (The score must be between 0 and 180 inclusive).

      • Score BETWEEN 0 AND 180 (A different way of writing a rule that the score must be between 0 and 180 inclusive).

Format check.

Sometimes, especially with membership and ID identifying codes, the code is made up of a mixture of numbers and letters in a certain order. For example, a national chess club may have a membership ID that is made up of 2 letters, followed by 3 numbers, followed by a letter, for example, RG662P. You can set up rules that look for patterns such as

LETTER-LETTER-NUMBER-NUMBER-NUMBER-LETTER

If TYP23F is entered into a data input form, it will not get put in the database. It will be rejected and a helpful error message should pop up! This kind of check is known as a 'format check'.

Length check.

In text fields, you can tell a database to accept an entry only if it is less than a certain number of characters. For example, you might set a field up called COLOUR to be less than or equal to 10 characters. If someone entered BLUE WITH A HINT OF MANGO, it would get rejected because it doesn't follow the validation rule. It has more than 10 characters.

Allowable values check.

You can set up fields to only accept certain values. For example, you might have the Boolean field GENDER set up to only accept MALE or FEMALE. You might set up a text field called WHEN DO YOU WANT TO LEAVE? so that it can only accept MORNING, NOON or EVENING.

Check digit.

Commonly used on bar codes, this is a technique used to see if a number has been entered correctly. A number, called a check digit, is placed at the end of a code. The bar code is read and a maths formula applied to it. If the result is the same as the check digit, then the code has been read correctly. There are different methods of working this out. One method is called 'modulus 11' and is used in ISBN numbers. Study this example

Suppose you have the number 12162? The question mark at the end is the check digit but what should it be? The check digit will be calculated, in this case, using 'modulus 11'.

42 is then divided by 11 (because we are using modulus 11. You could use modulus 6, or 7, for example). This equals 3 remainder 9. The remainder is now subtracted from 11 to give the check digit. 11 - 9 = 2.

2 is the check digit!

There are 2 special cases to remember.

      1. If the remainder is 0, then the check digit is 0.

      2. If the remainder is 1, then the check digit is X.

What about 13 digit ISBN codes?

The code for a book under the barcode is 978-0-954351472. The check digit is the number on the right hand side, 2. A barcode scanner would read the first 12 digits of the ISBN code and then see if it got a 2. If it did get a 2, it would know that the ISBN code had probably been scanned correctly. How does it work out the check digit?

      • First, it multiplies each digit in turn by a 1, then a 3, then a 1, then a 3 and so on, and adds up the result, like this:

      • (9 x 1) + (7 x 3) + (8 x 1) + (0 x 3) + (9 x 1) + (5 x 3) + (4 x 1) + (3 x 3) + (5 x 1) + (1 x 3) + (4 x 1) + (7 x 3) = 138

      • Next, it divides the 138 by 10 to get 13 remainder 8.

      • Finally, it subtracts the reminder from 10, so 10 – 8 = 2. (This system uses something called ‘modulo 10’, which just means the remainder after you divide a number by 10).

      • As you can see, the check digit is 2, which is also the number on the right hand side of the 13 digit ISBN code.

Do note that 10 digit ISBN numbers use a different system to the one described here for 13 digit ISBN numbers.

Presence check.

Some fields are more important than others and must be filled in. For example, suppose you had a database that stored car adverts. A customer rings up to place an advert and the operator types the advert into the database using a data input form. If the customer doesn't know the colour of the car, it is not a disaster and so it doesn't need to be filled in. If the customer has forgotten his or her telephone number, however, this is a problem - how can anyone contact him to buy the car? This field cannot be left blank. If the operator presses ENTER to enter the whole advert and this field has been missed out, then a helpful error message should appear. This kind of validation rule is known as a 'presence check'.

Data verification

Once sensible data has been entered, verification techniques should be employed. Verification is the process of checking that the sensible data entered is actually the data you want! For example, suppose you have a field NAME. You enter JONNS. This is valid data. But when you double check it, you realise that it should be JONES not JONNS. You can then correct it. This is verification. There are a number of verification methods in common use.

Proof-reading

Sometimes, simply proofreading what has been entered into a database will be enough to spot errors. Proofreading can be done by the person entering in the data but it is usually best done by someone other than the person who typed the data in as they are more likely to spot mistakes.

Reading back.

When you book a holiday at a travel agent, you give them all of your details and they type them into their computer. They can verify that your details are correct simply by reading them back to you and asking you if they're correct! Sometimes, organisations need you to confirm some details they have given you. In this case, they may print out what you told them on a form and then ask you to confirm the details, sign and return it. If you have ever ordered anything over the phone, then you will know that after you have placed your order, the operator will read it back and ask you to confirm that the order is correct.

Tick the box and click to proceed!

If you buy something on the Internet, you will typically place an order by typing it into a form. The order will then be prepared, possibly rearranged and then displayed back to you on the screen. You will be asked to check it carefully, tick a CONFIRM button and then click a PROCEED button (or variations of this procedure). This procedure is another example of a verification method. Indeed, it may be followed up with a confirmation email, which is a further method of verification.

Re-enter whole data file.

This is used in data processing environments. Operator A types in a set of paper-based orders, for example. They are saved in a file, but not entered into the database yet. When operator A has finished typing in the orders, the paper-based orders are passed to operator B. He re-types them all in. As operator B types in an order, it is checked by the software to see if what he has typed in is the same as what is held on file, what operator A typed in. If they are the same, then the order is entered into the database. If they are not the same, then an error is flagged up and the order needs to be carefully checked. Flagging up can be done in a number of ways. The keyboard could 'beep'. An error message could be displayed, or all of the errors saved to file and an error report printed out at the end of data entry. Although this may seem a little long-winded, it is an important method of verification for critical data, such as for entering in data from passport applications.

A database can still become corrupt

It should be recognised that even with clever validation and verification techniques in place, the data entered into a database could still be compromised. For example, it is possible that both operators during validation entered in the same, but wrong, data - perhaps they both misunderstood the handwriting on the paper-based forms. It is possible that both operators entered in the information on the form, but the information on the form was wrong! Even with the best methods of validation and verification in place, data can still be compromised.

Errors in data transmission

An introduction to data transmission errors

When data is transferred from one place to another, it can become 'corrupted'. For example, electrical cables and electrical devices generate a magnetic field. This field can interfere with the electrical signals that make up transmitted data, resulting in bits being changed from a 'one' to a 'zero' or from a 'zero' to a 'one'. It is necessary for computers to check that data has been sent correctly, that it hasn't become corrupted. We can protect the integrity of data as it is transmitted in a number of ways. We will discuss four of them here. They are:

      1. Parity checking.

      2. Parity blocks

      3. Echo.

      4. Check Sum.

Parity

When you send a byte of information using 7-bit ASCII, you have one bit spare. This can be used to check for errors in transmission. We have already said that when bytes are moved from one place to another, particularly over long distances, the bits that make up a byte can get corrupted by electrical interference. By using an error-checking method known as 'parity checking', half of these types of errors can be detected. Parity checking involves both devices deciding in advance whether they are going to use even parity or odd parity. There is no advantage of one method over the other. With even parity, the number of bits in every byte must always be even. With odd parity, the number of bits in every byte must be odd.

An example of parity

For example, you are sending the 7-bit ASCII code 0001001. The byte we need to send is ?0001001. The question mark is the parity bit, which we have to decide whether to make a zero or a one. When we look at the ASCII code, we can see only two bits are set (set means a 'one'). If it was decided before transmission to use even parity, we know that the number of bits in every byte must therefore be even. We must make the parity bit in this byte a zero to keep the total number of bits even.

If we had decided in advance to use odd parity for data transmissions, then in the above example, we would have had to set the parity bit to a one, to make the total number of bits in the byte odd.

Even parity

Let us assume that we are using even parity, and the byte 00001001 was sent. The computer that receives the byte knows that it is using even parity because it was agreed in advance. It counts the number of bits in the byte and if they are even it accepts the data. If it counts an odd number of bits, however, then one of the bits must have been corrupted and it must signal to the sending computer to send the data again.

Odd parity

Let us assume that we were using odd parity, and the byte 10001001 was sent. The computer that receives the byte knows that it is using odd parity because it was agreed in advance. It counts the number of bits in the byte and if they are odd it accepts the data. If it counts an even number of bits, however, then one of the bits must have been corrupted and it must signal to the sending computer to send the data again.

If you send 00001001 using even parity, and two of the bits get corrupted, for example, to 11001001, then there will still be an even number of bits and the data will be accepted, despite their being an error. If 4 bits or 6 bits or 8 bits were corrupted in this example using even parity, there would also be no error reported. In fact, an error will only be reported if 1, 3, 5 or 7 bits get corrupted. The same argument applies to picking up errors using odd parity. Parity checking is a useful way of picking up half of the errors created during data transmission.

Parity blocks

By using two parity bytes in a block pattern, you can pinpoint where a bit has changed in a block of data. Here are 7 bytes of 7 bit data, arranged as a block:


We will use even parity. We work out the parity bits for each row and column, like this:


Notice that one bit has been shaded. Let's assume that there has been a transmission error and that bit where it is shaded got changed:


We know exactly which bit changed now because we can check each vertical and horizontal parity bytes again. Two of them are now wrong and because they are in a grid, we can use where they meet to pinpoint where the problem is.

Echo

Another very useful way of checking if a message has been sent successfully is to use a technique known as 'echo'. The way this works is that a message is sent. The receiving computer then returns the message to the sending computer that asks, "Is this what you sent"? If it is, then the sending computer signals to the receiving computer that the message was sent correctly. If it isn't, then the message is re-sent. Echo requires data to be sent twice and therefore takes longer.

Check Sum

Another method used to check data is the check sum method. Data is usually sent in blocks because that is the most efficient way of sending the data. Apart from the data, however, an extra byte is added, known as the 'Check Sum'. The value of the check sum is arrived at by carrying out a calculation on the data to be sent, called a 'hashing algorithm'. When the receiving computer receives the data, it does its own check sum calculation using the same hashing algorithm, and then compares its result with the check sum that was sent. If they're the same, then it accepts the data. If they are not the same, then it can request the data to be sent again.