Learning Outcomes:
• describe the characteristics of data in un-normalised form, first normal form (1NF), second normal form (2NF) and third normal form (3NF);
• describe the advantages and disadvantages of normalisation;
• normalise data requirements for a scenario to 3NF;
• describe the components of a data dictionary;
Normalisation is a formal bottom up process. The documents used within an organisation can be examined. Data to be stored can be identified from within the documents. Normalisation involves the application of a set of rules to remove undesirable characteristics and problems and can be used to test the correctness of an ER model. Normalisation will reduce data redundancy, storage space, and update anomalies such as insertion, deletion and modification. It improves the integrity of the data.
Un-normalised data is characterised by redundancy and inconsistencies. There are repeated fields and multiple values of attributes (Employee No, Name, Start Date, Pay Grade, Hourly Rate, Hours Allocated) for one Project. Here, one project can use several employees and that data will be reflected in a single row in the database. Additionally, Margaret Lynch is being stored twice which results in a lot of memory consumption and data redundancy.
Under these circumstances, anomalies can arise in the data. For example, if Margaret Lynch was to move to Pay Category B, two updates would be required in the flat file table. If the data is updated only in one place, then the content of the file is incorrect leading to poor data integrity. This problem has been caused by an update anomaly. When data has high levels of redundancy anomalies will arise, integrity is low and there will be inconsistencies due to data duplication.
The data in the table above could be normalised to enable it to be implemented in a relational database. Normalisation is a formal bottom up process. The documents used within an organisation can be examined. Data to be stored can be identified from within the documents. Normalisation involves the application of a set of rules to remove undesirable.
To write data in UNF complete each of the following steps:
√ Select a name for the main entity – in this case PROJECT.
√ Each field contained within the entity is listed in brackets.
√ Primary keys are underlined.
√ Foreign keys are represented with *.
√ Repeating groups are contained within {}.
√ Derived attributes such as average Hourly Rate should not be included here but the developer may decide to store the value to improve performance.
The Project data can be written in un-normalised form as follows:
PROJECT(Project No, Project Name, {Employee No, Name, Start Date, Pay Grade, Hourly Rate, Hours allocated})
Data is said to be in 1NF if all attributes are atomic and entities do not have repeating groups. (Atomic means that the attribute cannot be further sub divided). Clearly the table violates the rules for 1NF as there are repeating groups identified. We must remove the repeating groups into a new table with a copy of the primary key from the main table. Call the new table (or Entity) PROJECT_EMPLOYEE.
PROJECT_EMPLOYEE (Project No*, Employee No, Name, Start Date, Pay Grade, Hourly Rate)
PROJECT now becomes PROJECT)
The repeating groups have been removed from the PROJECT table and placed into a new table called PROJECT_EMPLOYEE. This table has a composite primary key made up of Project No and Employee No. This means that Project No and Employee No are required to uniquely identify a row in the PROJECT_EMPLOYEE table. For example the Project No 1200 would return 3 rows, whereas Project No 1200 and Employee No 10 would give one unique row. A copy of the Project No is kept in the PROJECT table. In the PROJECT_EMPLOYEE table, Project No is referred to as a foreign key. This facilitates the link or relationship between the two tables.
Data is in 2NF if it fulfils the conditions for 1NF and all non-key attributes in the relation are fully dependent on the primary key. Non-key attributes which are only partially dependent on the key are removed to a new entity along with a copy of their determinant (key attribute on which they are partially dependent). Since the PROJECT table has a single primary key field, there can be no partial key dependencies.
A table with a single attribute as a primary key is in 2NF. The PROJECT_EMPLOYEE table has a composite key made up of two attributes. In this case each non key attribute is checked for dependency on the key attributes. This table is not in second normal form because:
• Employee No can be used on its own to find one unique Employee Name, Start Date, Pay Grade and Hourly Rate. For example Employee No 10 is always Margaret Lynch. Therefore Employee Name, Start Date, Pay Grade and Hourly Rate are partially dependent on the composite key (Project No and Employee No). That is they are functionally dependent on Employee No only. Employee No is their determinant.
• In order to determine the Hours Allocated however, both the Project No AND Employee No are required. Therefore Hours Allocated is fully functionally dependent on the key and remains in the entity PROJECT_EMPLOYEE.
The new entity will be called EMPLOYEE.
PROJECT(Project No, Project Name)
EMPLOYEE(Employee No, Name, Start Date, Pay Grade, Hourly Rate) PROJECT_EMPLOYEE now becomes: PROJECT_EMPLOYEE(Project No*, Employee No*, Hours Allocated)
Data is in third normal form if it is in second normal form and there are no non-key dependencies. That is as long as an attribute does not depend on another attribute which is not the primary key. This is called a transitive dependency. All tables must be examined to ensure that there are no transitive dependencies. If a transitive dependency exists the attribute and its determinant must be removed to a new entity. A copy of the determinant is retained in the original table.
Examine the PROJECT table. Since this table has only one attribute along with the primary key, there are no transitive dependencies. Examine the PROJECT_EMPLOYEE table. We have already established that all elements in this table depend fully on the composite key. Examine the EMPLOYEE table. Clearly the Pay Grade attribute can be used to find a unique Hourly Rate. This constitutes a transitive dependency. Therefore, Hourly Rate must be removed to a new entity PAYRATE along with a copy of its determinant. The determinant becomes the key field in the new table.
PROJECT(Project No, Project Name)
PAYRATE (Pay Grade, Hourly Grade)
PROJECT_EMPLOYEE(Project No*, Employee No*, Hours Allocated)
EMPLOYEE now becomes EMPLOYEE(Employee No, Name, Start Date, Pay Grade*)
Note that a copy of the determinant is left in the EMPLOYEE table as a foreign key.
• The resulting database will take up less storage space because the duplication of data is minimised.
• Information retrieval will be more efficient because data is structured effectively. Queries will be processed faster.
• Less redundancy means less inconsistencies in data because data will only have to be entered once.
• Data integrity will be increased.
Normalisation is a complex process required to create the database structure.
• Normalisation can generate more tables than an un-normalised database
• More tables mean a more complex database and queries may be slow to run. • It is necessary to assign more relationships to interact with larger numbers of tables
• With more tables, setting up queries can become more complex
This is a great video on Normalisation. It will not work in school as it is on Youtube, but should work at home or on your mobile.
A data dictionary is known as a physical data model. A data dictionary is a file containing descriptions of the structure of the data stored in a database. The data dictionary contains field names, data types, field size, format, default values, entity or table names and whether or not the field is a primary key and details about validation. A data dictionary can be used to provide metadata or ‘data about data’. This allows identifiers to be attached to key data which can then be structured, retrieved and used.
Possible Exam Questions
1.Describe the benefits of normalisation. (4 Marks)
• The resulting database will take up less storage space because the duplication of data is minimised.
• Information retrieval will be more efficient because data is structured effectively. Queries will be processed faster.
• Less redundancy means less inconsistencies in data because data will only have to be entered once.
• Data integrity will be increased.
2.Explain the 3 rules of normalistation. (6 Marks)
1NF- All attributes are atomic repeating groups are removed as a separate entity
2NF-Data is in 2NF if it fulfils the conditions for 1NF and all non-key attributes in the relation are fully dependent on the primary key
3NF-Data is in 3NF if it is in second normal form and there are no non-key dependencies.
3. Normalise the table below to 3NF, show in your working 1NF, 2NF and 3NF. (6 Marks)
4. In relation to the table above explain the following terms.
Data Inconsistency (2 Marks)
Data inconsistency is when data in the table is repeated and the values differ. In the table above the student S3 Green has two different dates of birth.
Data Duplication (2 Marks)
Data duplication is when the same attributes are repeated unnecessarily, for example course title, tutor name, student name, DOB, Student Status are all being repeated for each record.
5. Normalise the table below showing each stage (1NF, 2NF and 3NF). (6 Marks)
1NF- Student (Student ID, Student Name)
Exam(Exam ID, Student ID*,Exam Name, Result)
2NF- Student (Student ID, Student Name)
StudentExam(StudentID*, ExamID*, Result)
Exam(Exam ID, ,Exam Name)
3NF- As above no changes required.
Do not write all of the attributes out in one table for 1NF, that is 0NF. For 1NF we must separate the tables in two removing the repeating group.
Result goes in the link table as it depends on the student and the exam.
We won't always have to make a change for 3NF but we must still write it out.
6. Normalise the table below to 3NF, show all of your working (9 Marks)
1NF -Film (FilmID, Title)
-Actor(ActorID,ActorName,AgentID,Agent Name)
2NF- Film (Film ID, Title)
Film_Actor(ActorID, FilmID)
Actor(ActorID, ActorName,AgentID, AgentName)
3NF- Film (Film ID, Title)
Film_Actor(ActorID, FilmID)
Actor(ActorID, ActorName,AgentID )
Agent (AgentID, AgentName)
Explain why the data in this table is not in 1NF (2 Marks)?
The data is not in 1NF as there is a repeating group, (ProdID, ProdName, Size, Cost, Quantity)
Primary Key (2 Marks):
A primary key is used to uniquely identify each record in a database. Each order can be identified through the order ID.
Composite Key (2 Marks):
A composite key is when two or more fields act as the primary key, orderID and prodID in the order-product table.
Explain why the data in the table is not in 2NF(2 Marks):
The data is not in 2NF as there as ProdName is not fully dependant on both parts of the primary key, it is fully dependant on Prod ID and as such sjhould be moved to a separate table.