Normalization is a technique for designing relational database tables to minimize duplication (data redundancy) and in so doing reduce the risk of data anomalies (ie. deletion, update and insertion anomalies).
Update Anomaly - Can occur when we need to update the same data in more than one place.
Deletion Anomaly - Can occur when a deletion causes an unnecessary loss of data.
Insertion Anomaly - Can occur when we add a record and it does not satisfy the design or primary key requirements.
This example shows students’ details and which sports team they play for. The table also includes to which house the student belongs, the coach of the team and the number of matches the student has played for each specific team.
There are a few problems with tables that are not normalized:
There is a risk of an update anomaly. Because data is repeated, when we update we must update all occurrences.
There are fields left empty – meaning space is wasted.
There are repeating groups (or columns). If one student wanted to play for a fourth team we would need to add Team4, Coach4 and NMatch4 columns for all students.
First normal form (1NF) sets the very basic rules for an organized database:
Eliminate duplicative columns (repeating groups) from the same table.
Identify primary keys.
When converting to 1NF you must get rid of repeating groups by rather repeating rows. This can lead to data being repeated, e.g. Paul McCartney’s details are listed three times – once for every team for which he plays. However, there is no longer the waste of space seen in the un-normalized table.
A primary key is a set of fields which uniquely identify a row. In this case the StudID and the Team together were chosen as the primary key. Two boys might have the same name so it might not be a good idea to use the name as part of the primary key. You must always show which fields have been selected as the primary key by underlining them.
No repeating groups
Key chosen
In 1NF there is a risk of an update, delete or insert anomaly. An update anomaly can occur because there is data redundancy (i.e. data is repeated). A delete anomaly can occur if a student stops playing for a team. E.g. if Ringo Starr quits both 1st team rugby and U16B cricket then his details will no longer appear in the table and a delete anomaly has occurred. Finally, I can no longer insert a new student into the table unless he plays for a team, since the team field is part of the primary key.
Second normal form (2NF) further addresses the concept of removing the need for duplicative data:
Meet all the requirements of the first normal form.
Remove subsets of data that apply to multiple rows of a table and place them in separate tables.
Create relationships between these new tables and their predecessors through the use of foreign keys.
The diagram above is called a dependency diagram. Theoretically we could find out a student’s name, house and house colour just by knowing his student ID. In other words the name, house and house colour are dependent on only the StudID part of the primary key – this is known as a partial dependency. Similarly, I can find the coach of a sports team if I know the name of the team. The parts of the primary key must be underlined. In order to find out how many matches a particular boy has played for a particular team, I need to know both who the student is and what team were looking for. In other words, the number of matches is dependent on both parts of the primary key (i.e. StudID and Team). This is known as a full dependency.
Once you have drawn a dependency diagram it is easy to convert to 2NF. Firstly, create a table for each partial dependency:
Students Table
Teams Table
Next we’ll need to create a table to show how the students are related to the teams. In this relationship table we’ll need to add the parts of the primary key and any fields which were fully dependent on the 1NF primary key:
Relationship Table
Same as 1NF
No partial dependencies
Risk of update, delete and insertion anomaly reduced.
Also note that we have removed any compound primary keys.
Third normal form (3NF) goes one large step further to further reduce data redundancy:
Meet all the requirements of the second normal form.
Remove columns that are not dependent upon the primary key.
Note that we can work out the colour of a house without needing to know a student’s id or for which he is playing. In other words, the house colour is dependent on the house. Whenever there is a dependency which does not involve part of the primary key, we label it a transitive dependency.
To convert a table to 3NF we simply do the same as 2NF but put all the transitive dependencies in their own tables:
Students Table
Teams Table
Houses Table
Relationship Table
Same as 2NF
No transitive dependencies
Risk of update, delete and insertion anomaly reduced.