Data storage solutions

Data storage in Excel: easy, flexible, but with low data accuracy, and poor version control

Many ecologists use Excel spreadsheets to store data because they are easy to create and intuitive to use. Data entry is quick, thanks to the simple copy-paste functionality across columns and rows, and related tables can be organized into multiple sheets within a single workbook. While Excel offers great flexibility, this also poses challenges, particularly in two key areas of data management: data accuracy control and version control.

First, Excel lacks built-in safeguards to ensure low data accuracy control. For instance, it does not enforce consistency in data types within a column (e.g., text, dates, or numbers) and does not automatically validate entries against predefined codes or categories. This increases the risk of errors during data entry and organization.

Second, using Excel as the central dataset in a collaborative project can lead to poor version control. Often, each team member maintains their own copy of the dataset, which becomes problematic if the data are still evolving—when new data are added or errors are corrected. This increases the risk that collaborators may unknowingly be working with different versions of the dataset, leading to inconsistencies and potential miscommunication.

Local or server-based database software: difficult, inflexible but high data accuracy and good version control

These downsides of Excel of low data accuracy and poor version control are typically addressed by using instead relational database software for data storage. Local databases, such as Microsoft Access, FileMaker Pro, or SQLite can solve the data accurracy issue. They offer easy setup and flexibility but still carry the risk of each collaborator maintaining their own copy, so not solving version control issues. This version control problem can be mitigated by using server-based database management systems like MySQL, PostgreSQL, or Microsoft SQL Server, which create a centralized version control system (CVCS) for data. CVCS ensures that all collaborators access the same, up-to-date dataset, making it a technically sound solution widely adopted by large organizations for managing extensive databases.

However, these server-based database systems come with significant drawbacks. They are more complex to set up and less intuitive for new users (such as students), offer less flexibility, and require ongoing maintenance, including a central server and a dedicated database administrator.

Best of both worlds: using Google Sheets as a Relational Database

The above solutions suggest that there is a nescessary trade-off between good ease-of-use, flexibility versus good data accuracy and version control. But instead we want a bit best of both worlds. We use Google Sheets as a relational database as such an intermediate solution. It has the ease-of-use and flexibility of a spreadsheet solution, also various aspects of accuracy control and version management control of server-based database systems.

See this example database illustrating the use of Google Sheets as a Relational Database

Page updated

Report abuse