Micro case studies - sharing data and code

These micro case studies break down the process of data/code sharing into composite steps and show how different researchers the the University of Sheffield have achieved these.

Are you interested in contributing a micro case study to this page? Please contact rdm@sheffield.ac.uk and a member of our team will be in touch.

Anonymising or de-identifying data

Claudia von Bastian, Psychology - Anonymising or de-identifying data in cognitive psychology

When sharing psychological data, it is critical to ensure anonymity of participants. The first step is to ensure to remove any variables that could identify participants. In particular, combinations of demographic variables, like age, gender, and ethnicity, can be problematic – especially for people at intersections of protected characteristics. Therefore, I restrict access to any demographic variables to reasonable requests from other academics and omit these data from the set I share publicly. Next, participants should not be able to identify their own data once it is shared online. For data collection, I typically use self-generated codes (e.g., a combination of characters and numbers based on the name and phone number) or a pre-defined code (e.g., a number between 1000 and 9999) which participants may still remember. Therefore, I replace those codes with a random list of new, random numbers. Using R, an easy way to do so is to create a data frame with two columns – the old code and the new one – and then merge that data frame with a data frame containing the data (e.g., using inner_join() from the dplyr package), using the old code as the variable to join by. Equivalent functions exist in other scripting languages, and this can also be accomplished manually using spreadsheets. After double-checking that the old codes are also not included in file names or other shared materials, the new, anonymised data set is ready to share!

Ensuring data/code are fully documented/commented

Aneta Piekut, Sheffield Methods Institute - Documenting survey data from survey data from the Migrants Essential Workers project

When preparing survey data from the Migrants Essential Workers project for the ORDA repository, we created a technical report which not only contained information on survey design, sampling and fieldwork procedures, but also detailed information about the created datasets.

We published two datafiles, one containing valid responses only of participants who answered at least 60% of the core survey questions (which we called ‘Respondents’), and a second one for all survey participants (called ‘Clickers’, as some participants only clicked through some of the first questions). Additionally, the largest part of responses came from participants recruited via a dedicated Facebook advertisement campaign targeting Polish migrants in the UK. As such, some variables were created passively as embedded data in a survey URL link and did not come from asked questions.

Our technical report documented differences between the two datafiles, how passive data was generated, and how a few new variables improving data usability were created. For example, variables generated through passive data collection recorded where our respondents came from (Facebook ads, project website, partner organisation help), which Facebook ad they clicked on to access the survey, or what was the language – Polish or English – of the questionnaire they filled in.

Preparing a dataset for sharing

Jamie McLaughlin, Digital Humanities Institute - Preparing the 'Tattoos in the Digital Panopticon Database, 1793-1925’ dataset

Depositing ‘Tattoos in the Digital Panopticon Database, 1793-1925’ prompted us to evaluate our data from the perspective of a scholar coming to it for the first time. We reviewed our assumptions and internal conventions, which led to us dropping or re-engineering several columns. We collaborated on the documentation using Google Docs, and in the process of writing discovered and remedied some inconsistencies, in particular with respect to some of the more nuanced column definitions.

Fred Sonnenwald, Civil & Structural Engineering - Preparing a large Computational Fluid Dynamics (CFD) modelling generated dataset

There is approximately 4 TB of data behind our dataset. We felt this amount of data infeasible to archive as the results of a single research paper, as well as potentially unhelpful. Pruning so that the dataset was only 16 MB consisted of careful consideration of what to include and writing a detailed README.

Choosing what to include was influenced by the analysis performed for the accompanying journal article. The article focuses on spatially and temporally averaged parameters so we decided the dataset should as well, aiming to include enough data for the dataset to be validated against existing literature and for our article’s conclusions to be re-derived. We reasoned that as the CFD tools we used were both commercial and deterministic, a detailed methodological description would allow for reproduction of the larger data. This is not the ideal solution, so we mitigated this by including a wider range of averaged parameters than used in the article, particularly including intermediate averaged parameters. The dataset also includes detailed specifications of the model geometry, and the README also not only explains the data formatting, but includes lists of the specific CFD modelling options used.

Ensuring clarity of filestore structure

Fred Sonnenwald, Civil & Structural Engineering - Organising a multi-decade dataset on disk

Part of moving to open data is the publishing of previously unpublished research results, as our recent dataset combining thousands of data files collected over a 25 year period testifies to. While the data were collected within the same rough experimental framework, technological advancements and multiple researchers resulted in inconsistent file formats and organisation. We clearly needed to reorganise the data, and this process determined the file structure of the dataset.

I first identified what data there should be based on previous reports, which turned into a table of independent variables. Hunting through old CDs and ZIPs, I painstakingly filled out a list of what files should match these conditions. Once done, I wrote a script to read in and convert the data to a consistent spreadsheet format. Having at some point heard the principle that a dataset should ideally be usable without its readme, I decided my script would write out files in a folder structure organised by increasing importance of the variables. This ended up at about 4 folders deep, with the filename including the final variable. The filenames also included numeric configuration and file identifiers referring back to a main overview spreadsheet to be easier to read in programmatically.

Converting data to open/accessible file formats

Jamie McLaughlin, Digital Humanities Institute - Considering file formats for the 'Tattoos in the Digital Panopticon Database, 1793-1925’ dataset

During the preparation of this dataset for deposit, we enjoyed a lively debate about how best to deposit CSV (Comma Separated Variable) data. CSV has been used for decades without a formally defined standard, and strict compatibility between applications using CSV remains a problem. This led us to discover RFC4180 (https://datatracker.ietf.org/doc/html/rfc4180), an attempt to strictly define CSV for better interoperability. Adopting RFC4180 for this deposit and writing some filter and converter programs for it has improved a number of longstanding interoperability issues in our institute.

Exporting code to a repository / getting a DOI for code

Yuliang Weng, IT Services - Exporting the code for the R shiny app Which Politicians Receive Abuse During 2019 Election Campaign? to the ORDA repository

This software, published on ORDA, was created as part of the work of creating visualisations for the data visualisation hub (Dataviz.Shef). This software item contains both the code for the R shiny app, and the processed data for the app to function. The raw data originated from a dataset that is also published on ORDA and appropriate attribution was given. The software was originally hosted on Github and imported into ORDA seamlessly using the Github integration tool provided by Figshare. It is worth noting it is possible to synchronise the item published on ORDA with the Github repository by creating a new release.

The software was licensed under CC BY 4.0 which means it can be shared and adapted freely with appropriate attribution. The ORDA record can be found here.

Robert Chisholm, Computer Science - getting a DOI for code (FLAMEGPU)

FLAMEGPU is an open-source framework for developing large-scale GPU accelerated complex system simulations. The code development has been version controlled via a Github repository. To enable the software to be cited, we set up Zenodo DOIs for the repository. We did this by logging into Zenodo with a GitHub account (with control of the repository), navigating to the GitHub integration page and enabling support for the repo. Once configured, Zenodo automatically generates a unique DOI every time a new release is created on Github. Additionally, Zenodo provides a “concept” DOI which always redirects to the latest release’s DOI. Zenodo also generates a badge that we have included in our repository’s README to clearly display the latest DOI. Likewise we have a CITATION.cff which includes the official authors of the software and their ORCIDs. Whilst the CITATION.cff requires manually updating with each release, in combination with Zenodo it provides a low maintenance strategy for publishing our software and allowing others to cite it.

Creating metadata that explain the dataset/code

Andrew Barr, Civil and Structural engineering - Creating metadata for my dispersion.m code

An important aspect of sharing code or datasets is ensuring that other users will understand what you have shared and how to make use of it. Well-commented code can help a lot with the “how”, but the “what” is often best communicated through the metadata: the title, description, keywords and images in the ORDA submission.

I usually approach the description as if it were the abstract to a paper. In the case of dispersion.m, the script carries out signal processing of experimental measurements, and so I first described the reasons why this processing is required, and the physical phenomena it accounts for. I then moved on to how the code operates, with an ordered list of the main processes which references the key variables and methods used. I keep this at a high level, as I have included more detail in the code itself and in the paper references if the user decides the code is what they are looking for.

The title "dispersion.m - A MatLab script for phase angle and amplitude correction of pressure bar signals" describes both the name of the main script and a summary of what the code can be used for. I also usually provide an image that shows an example of the output or demonstrates the processing that takes place. This provides an immediate visual representation of the code and acts as a thumbnail image for the submission.

Writing a README file

Katherine Fish (Civil and Structural Engineering) - writing a README for recent ORDA deposits on water quality and biofilms

When depositing recent datasets to ORDA, I included a README file to enable other researchers to understand the dataset sufficiently to make use of it.

My research is industry-facing, which can limit how much raw data can be shared. When writing a README file, I make sure to be clear about what is and isn't included in the deposit. In my recent ORDA deposits, the READMEs also clarify what data was collected, the file naming conventions used, units of measurement, and which aspects of associated publications the data underpins. This means that the dataset is hopefully as useful as it can be to other researchers who may be interested in it, which is something that might also lead to collaboration in the future. Being consistent about processes such as coding and anonymising data during the process of the research also helped when compiling my READMEs, as this provided a consistent set of practices to outline and avoided the need to carry out these activities retrospectively.

Where data cannot be shared

Selecting sample data for sharing

Katherine Fish (Civil and Structural Engineering) - selecting sample data for sharing where raw datasets on water quality and biofilms could not be shared

In my research, it's sometimes not possible to share a raw dataset in its entirety. This can be for a number of reasons, including the size of the dataset, the industry-facing nature of the research, and the data collection rules in application at the time the research was conducted.

In recent instances where it has not been possible to share all of the raw data for a project, I have chosen to share a sample of the data which supports analyses presented in associated publications - in the dataset Unintended impacts of chlorine on drinking water quality and biofilms, for example. I have also made use of other data repositories where these are more suitable and then provided a link in the ORDA file (e.g. using the NCBI database for storing DNA sequencing information). This means that the claims made in publications are fully evidenced by the data, making the research transparent and accountable, as well as giving a sense of the dataset as a whole.

Creating dummy data for sharing

Jim Uttley, Architecture - Using and sharing fictional datasets in a study with South Yorkshire Police

I have been working with South Yorkshire Police to understand the effect of light and lighting on crime rates. The method we developed to do this research requires knowing the specific location and time of a crime. Accessing these details for crime data is problematic as it poses anonymity and confidentiality issues, and South Yorkshire Police quite rightly were unable to share such data with me as an outside researcher. We therefore developed an approach that involved the creation of fictional crime data based on the structure and format of the actual data held by SYP. This was used to develop a data processing and analytical script using R. I shared this script with SYP who were then able to run it with their data and provide me with the aggregated, anonymised results. This overcame the data confidentiality and protection issues that would have been an issue had I directly accessed the raw crime data. It also helped protect against potential criticisms of p-hacking, data fishing and HARKing as I did not have access to the data in advance of preparing the hypotheses and approach to analyses. We have submitted this work as a Registered Report to PLOS One (currently under review). To support this planned publication, the fictional datasets and the R script will be shared openly.