Workflow & Data Management

Many aspects of research data management are embedded in the other pages on this site. If you Google "research data management best practices," you will get a host of sites that incorporate principles from data storage, data security, and open science. Washington State University and Ohio State University offer helpful summaries of many effective data management practices and strategies.

However, data management practices alone do not address the critical relationship between data and code. Code that you write in statistical software is usually the engine that turns raw data files into research results through a workflow, i.e., an organized series of reproducible steps.

The Data-Code Nexus

The university offers a number of systems for data storage, but how should you organize and think about the workflow and the relationships between the files you store there? How do you get from your raw data set(s) to a result in a reproducible way?

Keep in mind that code and data are separate. Code files are the instructions you give to the software telling it what to do with the data. They are actually just plain text files which can be read by any text editor, but your IDE will format and highlight key words per the programming language you use. You will write code to perform operations on the data, commonly including:

reading in or importing the raw data files to a format native to your chosen statistical software package
cleaning, transforming, and merging imported data sets to create derived data files
conducting statistical investigations on analysis data files which are the "final" derived files, i.e., the end result of cleaning, transforming, and merging
outputting statistical results and visualization files (writing code to do this improves accuracy and saves time over manual data entry into a spreadsheet for table or figure creation)

nexus navigation tips

Plan your folder organization. This is an art, not a science, but a good place to start is to have a folder for the project on your chosen storage solution. Within that, have separate sub-folders for data sets, code files, and results. Keeping code files in their own folder allows you to easily set up version control at any point. Depending on the complexity of your project, you may need additional sub-folders (e.g., raw, imported, and derived sub-folders if you are working with a large number of data sets).
Know when to write a data set to disk. When you're running code on a data set, the statistical package will load the data into the computer's memory. You don't have to write a copy to your storage location every time you perform an operation such as creating a variable, sorting, merging, etc. In fact, if you do, it will likely slow the processing and complicate your folder organization with too many files. There is no one single answer on when to write out a data file. Some key points at which you might want to do so include: imported raw data; the last derived data set generated by the end of a code file; an analysis data set which is ready for statistical processing.
Preserve and manage your code. Keeping code files well organized and documented with comments about what each section does is integral to being able to reproduce intermediate and final data sets and results. In addition, you might need to make a change somewhere in your code based on investigation of initial distributions, model results, an error you discover, or in response to reviewer comments. As long as you have the code and know how it works, you can always rerun it starting with the raw data if necessary. Follow best practices for research software development.
Name things well. Naming is more than half than half the battle in both data management and coding. Give everything -- folders, files, data sets, and variables -- descriptive but not overly verbose names so that you, collaborators, and future users can more easily understand and navigate the project.

Here's ChatGPT's advice on best practices in academic research data management...

"Managing research data effectively is crucial for ensuring its integrity, accessibility, and longevity. Here are some best practices in academic research data management:

1. Data Management Plan (DMP):

Create a comprehensive DMP outlining how you will handle data throughout the research lifecycle.
Include details on data collection methods, file formats, metadata standards, storage, backup, and sharing plans.

2. Organize Data:

Develop a consistent and intuitive folder structure for organizing data files.
Use descriptive file names and folder labels to make it easy to understand the contents.

3. Document Metadata:

Record detailed metadata for each dataset, including information about the data's origin, collection methods, variables, and any transformations applied.
Metadata should be standardized and include relevant information for reproducibility.

4. Version Control:

Implement version control systems (e.g., Git) to track changes made to data files and analysis scripts.
Maintain clear records of changes and updates to ensure reproducibility and transparency.

5. Backup and Storage:

Regularly backup data to prevent loss due to hardware failure or other unforeseen circumstances.
Use secure and reliable storage solutions, considering factors such as data sensitivity and access requirements.

6. Data Security and Confidentiality:

Implement appropriate security measures to protect sensitive data from unauthorized access or disclosure.
Follow institutional guidelines and data protection regulations when handling confidential or personally identifiable information.

7. Data Preservation:

Plan for long-term data preservation to ensure data remains accessible and usable over time.
Consider depositing data in trusted repositories or archives that provide persistent identifiers and adhere to data management standards.

8. Data Sharing and Accessibility:

Share data openly whenever possible to promote transparency and collaboration.
Use repositories or data platforms that support open access and provide proper attribution for shared datasets.

9. Data Ethics and Compliance:

Adhere to ethical guidelines and legal requirements related to data collection, usage, and sharing.
Obtain necessary permissions and informed consent when working with human subjects or sensitive data.

10. Training and Documentation:

Provide training and support for researchers involved in data management.
Develop documentation and guidelines outlining best practices and procedures for handling research data.

By following these best practices, researchers can enhance the quality, integrity, and impact of their research outcomes while ensuring compliance with ethical and regulatory standards."

Page updated

Report abuse