Codd's 12 rules



Codd wrote these rules to keep database software vendors honest. Especially in the old days (1980), many products were touted as relational database management systems (RDBMS) when they weren't. Failure to adhere to these rules can corrupt data, and it always leads to confusion.

I'll try to explain practical aspects as opposed to rigorous mathematical proofs. A full and detailed explanation requires reading Codd's published works, as well as works by several other luminaries. The topic is very technical and difficult.


http://en.wikipedia.org/wiki/Codd%27s_12_rules


Many thanks to the Wikipedia. I've kept some of the Wikipedia text, and added my own notes. (I found a rumor that Rule 0 is the foundation of all the other rules, and as a summary is not apparently viewed as one of the 12 detailed rules.)



Rule 0: The system must qualify as relational, as a database, and as a management system. 

It turns out that none of our RDBMS is a "true" RDBMS by the strict definition. However nearly all SQL databases are reasonably close. The are relational in that tables are related to each other and can be queried via a "join". They are databases in that they store data in an organized structure, and are well suited to large amounts of data. They are a management systems in that they include SQL statements to "manage" database structure and to perform administrative tasks such as granting permissions. The management system aspect is crucial: the system must be administered via the same query language and facilities as the data to be considered an RDBMS.



Rule 1: The information rule. All information in the database is to be represented in one and only one way, namely by values in column positions within rows of tables.

This refers to normalization (and perhaps to a lesser extent, data typing). Basically, only put one piece of data in each field, and do not put the same data into different fields or different tables.  People are templed to do things like put a comma separated list of names into a text field. Software outside of SQL can easily parse the comma separated list into an array. We call this "de-normalization" and it is a bad idea. De-normalization is a shortcut that usually ends poorly.


Rule 2: The guaranteed access rule. All data must be accessible.

When Codd invented relational databases, the standard practice was that much of the data could not be accessed directly. It was necessary to start with some known snippet of information, and then read linking data from various (obscure) places to find the buried scrap of data. All data in SQL databases can be accessed directly, even if it usually is accessed together with data that gives it context. Restated, every row must have a unique primary key. That key might be a multi-field key, or a single field, but it must be unique. As a result, some RDBMS have a hidden, unique row id for every row.


Rule 3: Systematic treatment of null values.

Codd realized that empty fields occur, and the database must have a consistent way to handle nulls. These aren't just empty strings or the number zero. SQL does handle nulls differently from empty strings. However, handling of "missing" is not necessarily distinct from null in nearly all RDBMS. This is an important mathematical problem, but in the real world it boils down to a policy decision that has to be carefully documented, and carefully tested in the application.



Rule 4: Active online catalog based on the relational model.

There must be system table that describe the data tables. All aspects of creating and modifying the data tables properties must be manipulated via SQL. It is also bad if the system uses strange, non-SQL tricks to keep track of your data. It is worse if the system can't find your data. 



Rule 5: The comprehensive data sublanguage rule.

We use SQL as the relational language. I've heard rumors that Codd was upset about the weaknesses and limitations of SQL. In defense of SQL I'll point out that it works. Granted, SQL is nearly always used together with a normal programming language. On the plus side (and as required by Codd) SQL does support definitions (tables and rows), views, create, retrieval, update, delete, security, integrity constraints, and transactions. Transactions allow work to be done without effecting the database until a "commit" is issued at which point all the changes take effect. Transactions may also be canceled via "rollback" in which case all the work essentially never existed and the state of the data is unchanged.



Rule 6: The view updating rule. 

A view is a table that has been created dynamically by a query. Not all SQL databases have updateable views in the practical sense, although in theory the system can update any view by recreating it from scratch.


Rule 7: High-level insert, update, and delete.

"High-level" means that you have effect multiple rows from multiple tables with a single query. While obvious to modern SQL users, old products were unable to do this, and I'm sure that led to obscure code, horrible bugs, and data corruption. The system must support set-at-a-time insert, update, and delete operators. This means that data can be retrieved from a relational database in sets constructed of data from multiple rows and/or multiple tables. This rule states that insert, update, and delete operations should be supported for any retrievable set rather than just for a single row in a single table.



Rule 8: Physical data independence. 

You can move the database to a different disk and it still works the same way. I think this refers to old products which relied on the file structure of the disk for data access.Changes to the physical level (how the data is stored, whether in arrays or linked lists etc.) must not require a change to an application based on the structure.



Rule 9: Logical data independence. 

In practice there are changes you can make to your table structures that won't effect queries, but you have to be careful. SQL is not comprehensive enough allow any logical change. There are known limits of SQL. In real life this is a challenge, but generally dealt with by common practices and certain safeguards (such as copying a table before major logical changes). I think by "logical changes" the rule refers to adding columns, changing column data types, and perhaps even adding and deleting rows. It would be very bad (and unlikely) to design a data warehouse where deleting rows from one table made a query fail. (Where "fail" is "crash". Returning no records when no records exist is not failure from the databases's point of view.)Changes to the logical level (tables, columns, rows, and so on) must not require a change to an application based on the structure. Logical data independence is more difficult to achieve than physical data independence.



Rule 10: Integrity independence. 

SQL saves table information in other SQL tables which are together known as the catalog. These changes are made at an administrative level via SQL queries, and usually these have little or no effect on the application.



Rule 11: Distribution independence. 

To me, this is similar to Rule 8 about physical independence. In fact, some RDBMS move data around to improve performance, and the SQL is completely uneffected.The distribution of portions of the database to various locations should be invisible to users of the database. 


Rule 12: The non-subversion rule. 

The system must not have features that allow you to subvert database structure integrity. Basically, the system must not include back doors that let you cheat the system for features such as administrative privileges or data constraints. We would never want to do this, and I can't think of any SQL systems that allow it.




Comments