Setting up Google Sheets as a lightweight relational database using the Star Schema
A relational database is composed of multiple tables that can be connected through relations or queries. In the final analysis, typically multiple tables are joined into one table. But during the entry and management of the data, the tables are kept separate. The reason for this is to avoid repetition of information and better quality data management. These two aspects are linked.
The Star Schema is the 'blood and bones' of most relational databases.
Check this example database that has been set up in Google Sheets using this Star Schema with metadata
This type of database management is intuitive and easily recognizable. As is in the name, this schedule can be schematized using a star. Here, our datatable containing our main observed data is at the center of the star. This datatable is continuously updated as we collect new data. It is typically empty when we start the work and is filled if we go along. These types of tables we call “Fact Tables”. Such Fact Tables, say data collected on the vegetation composition of plots, or soil moisture along a transect, is related to data characterizing fixed plots or transects to perform this collection with their spatial location, as well as the list of species that can be potentially observed with their traits. This type of data typically does not change over time. We may already compose these tables before we start the observations. For example, sampling locations of an experiment are fixed, species taxonomic information (family, order) does not change, or the list of potential species that we can observe in the study. The data tables that contain this type of information are call in Dimension Tables. The first variable in a dimension table is its ID variable, where each row has an unique identifier for that record, e.g., a plot_ID or a species_ID. This can be a code or a number. A Fact Table also starts with an ID variable characterizing that observation, but followed by usually one or several ID variables from Dimension table. For example in a fact table of observations on species composition of plots, the Species_ID and the Plot_ID, followed by an abundance measure (as % cover or number of individuals).
In a good relational database, each table thus starts with an ID variable also called its primary key. It has a unique value for each row in the table, and forms the connection to other datatables in the database. Dimension tables therefore typically reflect information about your study design. They are tables of your list of plots, the individuals involved in your study, a list of study areas, etc. They are lists of objects where you have different kinds of information about each object. They typically provide predictor variables, such as experimental treatments, species names, etc. The fact tables are composed of observations characterized by multiple ID variables from dimension tables, plus the response variables of your study such as abundances, behaviour, physiological or behavioural responses, aspects of vegetation structure, soil properties, etc. Dimension tables relate to fact tables, by providing the potential list of objects for different variables in the fact table.
These Dimension Tables are therefore connected to the Fact Tables by the points of the star (Fig 1). Meaning the combination of Dimension Tables feeding information into the Fact Table can create a complete dataset. These Dimension Tables contain all information on the sampling locations or species. For example, taxonomic information, coordinates, elevation and anything else that is fixed over time. These Dimension Tables can then be easily merged with the Fact Table using ID variables. This way, when entering data, we don’t need to enter all the information on the sampling locations everytime, but rather just enter the ID variable that belongs to the sampling location and all information can be found in the Dimension Tables.
But each point of the star is not limited to a single Dimension Table. Multiple dimension tables can be connected in a chain. For example: We are collecting data on vegetation types on 3 different locations. The different vegetation types are collected in the data table “FactVegType” (Fact for denoting that it is a Fact data Table, and VegType to indicate what is contained in the data table).
Here you see both Site_ID and Species_ID are an ID code that can be related to Dimension Tables “DimSpecies” and “DimSite” (Dim since it’s a dimensional datatable and Species or Site to indicate what type of information it contains). The first columns in these tables are Species_ID for DimSpecies and Site_ID for DimSite. The other columns contain other information about these species and locations.
Note that in the table FactVegType only the column Site_ID occurs, but not the column Xcoord. This is because if you know that a row is Site_ID is 001, you can lookup in the table DimSite that the elevation is 201 . So no need to replicate that information in this table. Adding this column will be done in the R script with a join_left() command.
Similarly in the FactVegType table you do not need to add a columns with the full scientific name of the species, as this can be looked up from the table DIMSpecies when you know the Species_ID. Through a data validation in Google Sheets you can ensure only these specific Site_ID’s and Species_ID’s can be filled out in FactVegType to ensure the integrity of your dataset as well as prevent spelling errors. After this, you can simple do left_join from dplyr in R and you have all the information you need assembles together in one table. But you do not manage the data as one table. In this case the following would produce a dataframe containing all variables from the above three example tables:
Alldata <- FactVegType |>
dplyr::left_join(DimSite,by='Site_ID') |>
dplyr::left_join(DimSpecies,by='Species_ID')
It contains all records of FactVegType, and adds the additional information known about plots from DimSite and the additional information known about species from DimSpecies.
Working with species codes
In some cases it is not handy while entering the data, to work with species numbers, as you may more easily remember species codes. In this case, you may enter the data as Species_code (with a regular validation to potential codes) and adding the the Species_ID 'automatically' through a xlookup formula already in Google Sheets in a column after the column containing the species code. Of course we can also do this later in a join
Documenting Metadata
Finally, to ensure that all data tables can be understood clearly, we also have information about the data tables and the variables used in these: your meta data. The datatables containing this meta data, for example the data dictionary with explanations of all variables found within the dataset, we call “Metvariables”. Here, Met to indicate it is MetaData, and Variables to indicate what type of metadata this datatable contains.
Example
See this full example of a star-schedule database with Dim and Fact tables