Databases

What is a database?

A database is "an organised collection of data, typically in digital form". The data is organised using a schema (a structure). The most commonly used structure is the relational database which is organises different data into different tables (relations) consisting of rows (tuples) and columns (fields). A relational database schema will describe, therefore, the tables that are present in the database with the fields and the data type of each field.

Other popular database types include "object" (consider the data that is stored in objects in OOP) and "XML". Each has its own strengths and weaknesses.

Microsoft Access

In school, there is a very good chance that you will use Microsoft's Access database. Access does have much of the same functionality that many databases servers provide, but there are some differences. Importantly, Access is not a server application and so programs that want to use the database must be able to access the actual file. A database server, however, can allow access from hundreds or thousands of people to access the data all at once. Access is, however, a very popular data manager for small applications that do not require exceptional performance and the same level of high availability that a database server might provide. These shortcomings exist because Access has been designed to be user-friendly as a first priority.

In learning SQL and by accessing a database through programming, we are given the same interface that we would normally use for “real” database servers. In theory, if you wanted to write your program to use a high-end server, all you would need to change is how you connect to the database rather than change all of the queries in your program.

Database Servers

Database servers, which include Microsoft SQL Server, MySQL, Oracle, Postgres and many, many more are designed to allow many more people to connect to them and to retrieve data from them. Such servers provide much faster performance and allow many more people to access the database simultaneously. Consider the Google search engine database. This runs on a database engine that was written by specifically by the engineers at Google. This database can handle hundreds of thousands of queries every second, searching through billions of web pages each time. In essence, that means that it can search through several trillion web pages each second!

Server Clusters

Databases, especially large ones such as Google’s search database and those which require extremely good performance are often located on more than one machine. These databases can spread the load of queries and operations that they do amongst several, hundreds or even thousands of servers. They follow a hierarchical structure which allows them to operate quite efficiently.

At the top of the hierarchy we typically have a master server. This server contains the “official” and most current version of the database. The master server, in a large cluster, will typically not have any interaction with any users. Its job is to make sure that all of the other servers in the cluster are operating with the same data and that any changes to the data are synchronised as quickly and efficiently as possible between all the servers, each of which has a copy of the database.

If the master server crashes, there are normally rules in place to determine which server in the cluster will assume the master role until the master server comes back online.

Since most queries to databases are accessor-type queries (sometimes as much as 90% of all the queries that a database deals with are to extract data from the database rather than add new data), these are sent out to peripheral servers which do most of the hard work of looking up information with the copy of the database that they have.

If any of these servers are passed a data-modifying query it is sent to a central master database which makes the change. The master database then sends out the change to all the other servers in the cluster.

Exactly how and when servers synchronise with each other is up to the programmers who write these clustered databases.

Server clusters are generally very efficient at sharing the load the queries amongst each other. This is known as load-shedding. This makes it easy for Google, for example, to be repairing several machines at once and when they are done, the machines are simply plugged back in. The load is shared amongst the functioning servers and, if the cluster has enough functioning servers, there is often little impact on overall performance.

Newly repaired servers will spend some time catching up on the changes to the dataset that they might have missed (or rebuild the entire database if necessary) and then immediately begin helping to share the load of queries.

Google estimates that they have 100 machine failures each day, and yet the search engine has experienced no entire downtime since late 1999. An interesting statistic is that 100 machine failures is less than 1% of their total server infrastructure!

Deleting Information

It might interest you to know that because of the danger of deleting and losing information, many production databases do not allow records to be deleted, except under extreme circumstances. Instead, each record has a field called “deleted” which is set to “No” by default. When the row needs to be deleted, the delete field for that record is updated to “Yes”. It is then the responsibility of the calling queries to ignore the deleted records.

This is also done to prevent fraud in many companies: if deleting from the database is not allowed (this can be done through database security settings – a feature which is not present in Access), a record of all the transactions is stored, even if they are “deleted”.

Some databases are even more serious about the audit trail and are even sticky about updates. In these cases, no row can be updated: only a new one added. The queries then need to take this into account. It can get very difficult to write queries for databases like these which need to take into account so many different constraints.

All of these considerations are part of advanced normalisation - and thankfully, we don't have to worry about that level of normalisation at a school level.

Page updated

Google Sites

Report abuse