Week 2: Databases: types, formats, and practical considerations
Week 2: Databases: types, formats, and practical considerations
Overview
This will include a brief overview of the use of databases, but will by no means be exhaustive. If we have sold you on the fact that databases are worth your time and the coolest thing since sliced bread then there are numerous books and online tutorials on the subject. Heck there is probably even a real class on campus too!
Learning outcomes
1. Ways to store data
2. Setting up a relational database
3. Organize data for efficient querying and manipulation
4. Working with a common database and R (optional)
Database types and formats
Ways to store data:
Paper: Yes, there are still data that exist only in paper form!
Flat-single file of rows and columns of data
Text files: *.txt, *.csv, *.prn
Spreadsheets: *.dbf, *.xls, *.xlsx, *.ods (open office)
Relational Database management system (DBMS): Microsoft Access, MySQL, Oracle
Why should I use a database system? My trusty spreadsheet is working just fine....
A database CAN WILL make your life easier by:
1) You want a job. Look at natural resource agency job listings - you'll see knowledge of MS Access or databases in almost every single one.
2) Giving you confidence in YOUR data
Makes you think about the data you will be collecting (integers, singles, double, text)
Limits redundant data entries minimized data entry errors
Non-redundant data entries make correcting data entry errors easier (i.e., manual error)
Standardization of data formats among years (easier synthesis for multi-year projects)
3) Allowing you to have everything ready to go for next site, season, and/or year
Multiple year studies
Standardized data formats among years (easier synthesis)
Data entry forms
Data reports
4) Easily create and save data queries for:
Cooperators and collaborators (gotta share with your friends)
Legality (freedom of information)
5) Easily create queries that can be reused annually
Take/disposition reports for permitting agencies
Mortality reports
IACUC/ACUP report numbers (How many critters do I have left before I need to amend?)
6) Data archival: create text file to stand the test of time and make your adviser happy...
7) File ambiguity: how many times have you tried to remember how you created a table in excel? found a file that contained the term 'do not use' in the file name?
How do we go about setting up a database?
1) Think about the data you are going to collect! This will likely be a part of your proposal and a natural extension of the datasheets you will use in the field.
2) How will the data you collect be formatted? Will it be integers? continuous? text? some a priori value (e.g., present/ absent, rare/common/abundant, numbers)?
3) Think about how your data relates to each other. For example, site or survey level data may have many critter-level data lines associated with it. This is what is known as the parent---child relationship. In the previous example, data collected at the site level is the parent and it can have many children---lines of critter data---associated with it. We will elaborate on this in the following section.
Relational data - I spend enough time with my data but I wouldn't say it is a relationship...
OK, what do we mean when we say "relational" data?
Data that is stored in various tables are related by unique identification numbers. Let’s think about this simply. Say you have several 'sites'---Site 1, Site 2,..., Site 10. At each Site there are some site characteristics recorded (e.g., date, weather) and critters captured. Species, length, weight, and sex are recorded for each critter captured. There also might be some comments for each critter that may be recorded. A datasheet for the data collected above might look like this:
Figure 1.—A "typical" natural resource datasheet and data.
How can we store this data when we get back to the office, field trailer, tent, coffee shop, or car ride to a conference?
Obviously this data could be entered into a spreadsheet like excel.
Figure 2.—How one could enter the data from the datasheet in Figure 1 in a spreadsheet. See file bad_excel.xlsx
So whats the big deal with this? Redundant data increases the chance of error. Why should all this site level data be entered 7 additional times? Seems like a waste of effort (and hard drive space). I realize it is easy copy and paste the entered data but hopefully I will demonstrate how relational databases can make your life easier.
If we refer back to the datasheet, we see there are natural structural breaks in the data collected shown in Figure 3.
Figure 3.—The red boxes denote the natural breaks in the data collected and recorded on this mock datasheet.
The circled data can then be entered into two separate tables and linked by a relational identification key as shown in Figure 4 below.
Figure 4.—The two tables show how one could enter data and then relate that data using a unique id. This seems much better than the spreadsheet in Figure 2. See sheet 2 in bad_excel.xlsx
Lets focus on the ID number in table 1 and corresponding id numbers in table 2 for now. In the tables in Figure 4, site data and critter data is linked by a common site id (siteId = 1). There is a single row of data associated with the 1 in the site data table, but there are many rows of data associated with 1 in the critter data table. This is what is referred to as a parent (site data table) and child (critter data table) relationship and this is the basics of setting up a relational database.
Setting up a relational database.
The first step is to create a database. In this case, we will be using Microsoft Access since it is readily available and in my experience used fairly ubiquitously in many natural resource agencies.
If you are using:
A Macintosh- there is no MS Access for Macs, so you will have to do this lesson using Citrix Apps. Instructions for using Citrix Apps are included below.
A PC - If you have Microsoft Access installed, you should be OK.
Citrix Apps allows you to run programs on a remote server, so that you can run Microsoft programs on a Mac or programs that you don't own remotely. Plus you can run programs from your smartphone.
Step 1. Map your ONID drive to your computer. It is best to save and download all of your files to this location.
For mac, follow these instructions: Map Onid Drive
For windows, follow these instructions: Map ONID Drive on Windows
Step 2. If using a personal computer, go to apps.oregonstate.edu and install Citrix Receiver. This is done for you on the campus lab computers. Instructions can be found here: http://oregonstate.edu/helpdocs/book/export/html/1038
Step 3. Open Citrix Receiver on your computer, and enter your password as prompted to log in. NOTE: On a personal computer, you may need to first establish a VPN connection.
Step 4. Under the Apps tab, open Access (the first application listed). You can also make it easier to find by adding it to your Favorites tab by clicking on Details and selecting 'Add to Favorites'.
Step 5. Access should open and we are ready to go! Make sure that you download the files to your ONID drive.
Getting started with Access
It is a simple process (note these instructions are specific for Access 2016 - other versions may be slightly different):
1) open the software (either in Citrix Apps or on your Windows PC)
2) click on 'Blank desktop database'
3) give it a name and save it!
Now you have built your first database! But we need to add some parts to the database so we can use it to store data.
Figure 5.— Screen shots for setting up a database. Pretty easy!
There are 4 relevant parts of a database that we will be going over:
Tables-a place to store data
Forms-a tool for data entry
Queries-a tool to retrieve stored data
Tables
Setting up tables
Now you have created a database it is time to create tables to hold the data---in a relationship of course. Tables are where data are stored in databases. These tables are similar to what you would expect to find in a spreadsheet program. But there are some important differences. The biggest difference is that YOU have to specify what type of data will be contained in each field (fields are the same as columns).
Why? First, it helps to make sure that the data are entered correctly. Also, it can save on memory. Historically computers were memory limited. In the past, hard drives were measured in kilo- or megabytes not gigabytes! disk space was at a premium. For example, it takes 2 bytes to store a number or letter and therefore setting up your database took some hard thinking about how to maximize the amount of data you could store while minimizing the file size. For example, lets think about years. Storing the last two digits of a year (1999 as 99) requires 4 bytes opposed to 8 (Ever heard of the Y2K bug?).
Setting up your tables requires some thinking about how your data is structured and formatted. The most commonly used field types (in my experience; see Table 1) are:
Single or Double: used for continuous data (numbers with decimals). The double refers to double precision for decimals.
Short or Long integer: used for integer data. If it is a long integer (i.e, PIT tag numbers) specify it as a long.
Text: for text entries - only allows up to 255 characters including spaces
Long Text or Memo: for text entries. Allows an excess of 60,000 characters
Date/time: pretty straightforward (formatting as a date will pay off later when we talk about working with dates in R)
Autonumber: this will be important when setting up a relational database.
Table 1.—Commonly used field types for ecological data.
Lets set up two tables to hold the data from the datasheet above using the field types identified in Table 1. Let's start with the parent table for site level data - we want to have all of the fields in the data sheet from Figure 3 above. A good practice in setting up tables is to provide a Field description so that others know exactly what is in the field and what units the measurements are.
Below is the final set up for the survey and critter data table - this is what yours should look like.
Figure 7.— Complete formatting of fields for the survey data table.
When we are done creating the fields and formatting them, we press the view table button in the upper left hand corner. Access will tell you that you need to save the table and you should go ahead and do so. Then you can see something like this:
Figure 8.—Table view of the data table created to hold the data. This is very similar to a spreadsheet. Not too scary, right?
Some things to notice about the table in Figure 8. In the survey_ID the entry is just "(New)" this means Access will assign it a number automatically (i.e., an autonumber). In the lower left hand corner the field description we entered when creating the table is visible.
Once the table to hold your survey data is created and saved we need to create a table to hold the critter data as shown in Figure 9 below.
Figure 9.—Design view for a table being constructed to hold critter data. One important thing to notice here is that there a field for a critter_ID but also a field for survey_ID. The survey_ID field will link back to the survey_data table.
Additional table information and considerations
Captioning field names
The field names as they are typed in during table construction (see Figures 6, 7, and 9) are what will be 'spit' out when you start to query the database and read those queries into a number crunching program. But for looking at your pretty data in tables, and this will be important for forms and reports, you may want to provide a caption for your field names. Field captions provide additional information about the field. For example, the critter table has a field named "length." Pretty dull right? When we look at the table the field name is just 'length' and nothing more. Using a field caption you can change the displayed field name to be something more informative like "Length (cm)". This is done by returning to design view of the table (press the triangle looking button on the top left) and put your cursor on the offending field and type whatever caption you would like (Figure 10).
Figure 10.—Adding a caption to a table field. In this example, the length field will be displayed as "Length (cm)" which provides more information than just plain old "length."
Validation rules---don't you like to feel validated?
You can also build additional screening into your database using validation rules. This can be important if you are entering lots of data and only screening a subset. These rules work in addition to the data types you already set up. Specifically, by specifying a field is an integer, you cannot enter letters into that field, but you could enter some funky numbers. Validation rules limit funky bunches of data that can happen when you have multiple technicians entering data. Lets work on the same 'length' field we added the caption to in figure 10. In this example we have never encounters a length higher then 300 cm. Given the previous and the fact that no critter can have no length it makes sense to constrain data entered for that field to be greater than 0 and less than 300. This is easily done by adding a validation rule as shown in Figure 11 below.
Figure 11.—Adding a validation rule to the length field. Notice that the field name is enclosed by brackets. This will be important to remember when we want to make calculations in queries.
With the validation rule in place, you can no longer enter things outside of the specified range. This can be very useful! Here is an example of what happens if you try to (Figure 12)
Figure 12.—This is what happens when you enter data outside of the validation rule.
Primary keys and indexes
You will want to specify a primary key for your table. This will almost always be the ID you set up (often as an autonumber).
Basically, a primary key sets up a field such that it is indexed, does not allow duplicates, and cannot be a null value. For more information see what the folks at Microsoft have to say about it here. When in doubt add a primary key as an autonumber (i.e., set it and forget it!). You might need it later! Each table should have a primary key and these will be used to create relationships among data tables.
Figure 13.—Setting a primary key for the survey table.
Creating relationships
At this point we have 2 empty tables created in our database. Now we need to relate them! To do this go to the "Database tools" tab and click the "Relationships" button. This will bring up a window where you can add tables to the relationship view. Add the tables you want to be related (in our case survey_data and critter_data tables). Below is a video demonstrating how this can be done for our survey and critter data tables.
VIDEO - CreateRelationship
Creating a form to make data entry easier
Setting up forms for data entry
Data can be entered directly to the table—which is a pain since you have to enter data in the parent table and then make sure related data is entered to the appropriate child table and that they are correctly related. The more efficient approach is to create a form. Setting up a form is super easy because we have already setup a relationship between the two tables. Just select the parent table (survey data), click on create on the menu bar, and then select 'Form'. You can edit the form from there. Alternatively, you can create a form using the wizard or form design views - just do a google search for great examples if you have questions.
Look over your form (Figure 14). I usually delete the survey_id since it is automatically populated and this keeps technicians from putting data where it shouldn't be and I will hide the critter_id since it too is automatically populated (you can hide a field by left clicking on the name and then clicking on 'hide field' from the menu that pops up). Here is a video that demonstrates how to create this form.
Queries are the work horse for getting your day to day data needs satisfied. Data queries are constructed using the underlying relationships you have specified and standard query language (SQL). But, programs like Access and Base do not require you to know how to write SQL, it does it for you as you construct a query. The nice thing about queries is that you can save them for future use. But keep in mind that since the query is actually just a bit of SQL code each time you open your database and run your query you ALWAYS have the most recent data!
You create a query by going to the Create tab in your database and clicking the "Create query by design". Below is a video that demonstrates how to create a query.
VIDEO— Create Query
Figure 16.—A form view of the 2 forms we set up using the form wizard with data entered! What is going on in the background? Once data is entered, Access enters an autonumber and then it propagates to all the related records in that form!
For instructions on how to customize your forms go here.
Querying your data
Figure 17.—The results of the query set up in the video.
Notice that the relationship we created is already there and denoted by the arrow!
Subsetting in queries
It is very convenient to subset data using the query. For example you might be interested in only a single species or year, alternatively you might want to have several critters. This is done by adding the constraints of interest into the 'criteria' field of the query in design view. Here are some commonly used commands used in ecological data:
Table 2.—Common commands used when subsetting data in a query.
NOTE: Access is not case sensitive, but you still need to spell it right!
Calculations in queries
Remember back to those pesky derived variables that I said you just shouldn't do prior to entering? What I am talking about is summarizing data, calculating densities, and so on...Just don't do that! Queries are a place to do these—or you could do them in R. But why? We are trying to minimize where human error—whether it be data entry, manipulation or calculation—can rear its ugly head and is propagated. Here are some common things you might want to do with ecological data you have collected.
Convert from cm to mm: code below will make a new field "length_mm" where length in cm is converted to mm
length_mm: [length]*10
Assign some state based on data collected: code below shows an if/else statement that uses length to classify juveniles and adults (usage- iif(condition, true, false))
stage: iif([length]<80, "juvenile", "adult")
Calculate some derived number from a couple of field: the code below shows how a condition could be calculated
condition: [weight]^3/[length]
Other date manipulations: pretty straightforward
year: year([survey_date])
month: month([survey_date])
See the query named 'derived_numbers' in database_demo for an example.
Derived numbers: LET THE COMPUTER DO IT! Databases should contain the raw data with no derived variables. For example, if you enter the area of a habitat surveyed and the number of critters collected, you should only enter the area and number, do not include the density (i.e., #/area) this is easy to do with queries or once data is imported to your analysis system. Plus a clever analyst might use a GLM to model the counts (assuming a Poisson distribution) with the area size as an offset...
Summary queries
Queries can also be used to generate summaries. These can be very useful when you need to summarize your data to get the count of individuals or species, the average temperature, or some other measure. Here is a video of how to calculate the number of animals observed at each survey using the survey and critter data tables:
VIDEO: Summary Query
Other types of queries (rarely used)
The query used in the previous example has been what is called a select query. You will likely use this for close to 100% of your queries that you will do in your career. But you may have opportunity to use other types of queries such as the append query which will append 2 tables together, an update query which you can use to update a field in a table, a delete query which you can use to delete rows in a table, or a make table query that generates a new data table. Unless you believe you really know what you are doing, stay away from these!
Practical considerations for building a relational database
Making data easy to subset and manipulate
Things to consider:
Should I capitalize? - doesn't matter for Access but it does in R!
IDs (sample ids, species ids, and so on)
Choose simple id's that are unique
Avoid coding schemes where number and letters can be switched (i.e., 1 and l)
Make missing data clear: One good choice is to use "-99" in fields.
Why? easy to subset (values >= 0, values != -99, works for text and ALL numeric fields (integer, single, double))
It is easy to deal with in R
What does that mean? What do you mean wide data or long data, data is data right? Okay here we go the top and bottom panels contain the same data but the top panel is wide format and the bottom is long format. Which do you prefer? Which is easier to manipulate in analysis? Which is more flexible?
Figure 22.—Wide (left table) and long (right table) formats for the same data. Notice that there are no 0 counts included in the long formatted data. Problem? No not at all this can be easily be accounted for in contingency tables.
In my experience it is usually best to go long rather than wide, especially with multiple species data data. This may seem counterintuitive since it is easier to summarize data in wide format, right? Yes and no. It is easier to summarize wide data in excel but once you are out of the spreadsheet things get very difficult if you are in wide format!
Exercises
Due 1 week from today by 5pm Pacific. The exercises for this week make use of the following files:
mock_datasheet.pdf, database_exercise.accdb.
A) Using the mock_datasheet.pdf answer the following questions in a word document:
How many tables would you use to organize the data in the mock data in a database?
How would you relate those tables?
B) Using the database_exercise.accdb answer the following questions in the same word document:
How is each table related?
What are the parent and child tables?
C) Using database_exercise.accdb perform the following in the database:
Make a query that includes Date, Gear, Species, Length, and Weight, include 2 new fields in the query, one for month and one for year and save the query as "dataset".
Make a query that summarizes the mean secchi for each survey conducted. The query should also include the survey date and be saved as "mean secchi" (hint- you may want to include survey_id in your query).
Please save all of the code you used in a single script and submit and a *.docx or *.rtf document for questions A and B, submit the database for question C to Jim one week from today by 5pm.
Suggested additional reading
Chapter 8 (Managing and curating data) p 207-238 in Gotelli, N. J., and A. M. Ellison. 2004. A primer of ecological statistics. Sinauer Associates, Inc, Sunderland, MA.
Elizabeth T. Borer, Eric W. Seabloom, Matthew B. Jones, and Mark Schildhauer 2009. Some Simple Guidelines for Effective Data Management. Bulletin of the Ecological Society of America 90:205–214. http://dx.doi.org/10.1890/0012-9623-90.2.205
BONUS: Getting data into R from a database (completely optional- not covered during lesson)
Importing database is a bit more tricky than *.csv files, it requires a way of communicating with the access database. This is done using ODBC drivers which requires the RODBC package. Special note: ODBC drivers ONLY work with the 32-bit version of R, so if you are working on a 64-bit operating system you just need to modify your Rstudio startup to default to the 32-bit version of R which is installed along with the 64-bit version. See here
The first thing to do is to make an object that is a communication 'line' to your database which is done using the code below:
install.packages("RODBC")
require(RODBC)
com_line<- odbcConnectAccess2007("C:/.../my_database.accdb")
Some things to know about here
odbcConnectAccess2007() works for both Access 2007 and 2010 if you need to communicate with a database with the extension *.mdb then you can use the same bit of code but the function is odbcConnectAccess2007()
Normally I like to make use of the working directory but in this case I always hard program the path to the database
You can use the sqlTables() function to return tables AND queries that are stored in the database.
sqlTables(com_line)
You can import a table or more likely a query using the sqlFetch() function .
my_data<- sqlFetch(com_line,"name of a table or query")
Using database_demo.accdb this whole process looks like:
com_line<- odbcConnectAccess2007("C:/.../database_demo.accdb")
sqlTables(com_line)# you should see an object listed "derived_numbers"
my_data<- sqlFetch(com_line, "derived_numbers")
my_summary<- sqlFetch(com_line, "summary_query")# another query from the same dbase
Sold yet? The convenient thing about using a database with something like R is that there are no "intermediate files". No saving things to a *.csv for analysis and then wondering hoping that your data is all correct and you have the most up to date data.
Every time you open your R script and connect to your database you are for sure going to have the most up to data version of your data!
BONUS: Avoiding and correcting database errors
Database audit and QA process overview: Confidence in your data comes from knowing that it is free of entry errors and is consistent (i.e., no shifted decimals). This is a process known as a quality assurance. This data assurance processes has evolved over my experiences working for various agencies as well as my own research. It is by no means perfect, but it has worked pretty efficiently thus far and will hopefully provide a jumping off point for others.
Photocopy original datasheets
Archive unaltered original datasheets (preferably in a secure and separate location)
Have technicians enter data from photocopies to database using a FORM that replicates the datasheet
Generate a REPORT to perform error checking
Highlight data entry errors on photocopied datasheet
Correct highlighted errors in database and initial highlighted entry on the datasheet
Assign data line a QA code. I like an integer scale varying from 1 (best quality) to 5 (worst quality). This also provides a nice way to evaluate potential analysis outliers in residual plots and it is easy to subset data!
Archive report and photo copied datasheet which effectively creates a fully annotated data audit trail!
Query and analyze to your heart's content!