Preparing tidy data for sharing and analysis

The goals of this guide are to enhance the scientific rigor, transparency, facilitate reproducibility, and speed turnaround time for statistical consultation.

Two essential things to send when sharing data

1. Tidy data (ideally without personal identifying information)

2. A code book for the tidy data set

Optional things to send when sharing data

3. Raw data (and soft copies of data collection forms if available)

4. Documentation explaining how raw data are transformed to tidy data, including any restrictions and new variable creation (which can be described with variable definition sheets, VDS) that describe how raw data are related to tidy data.

Details

Tidy data

Tidy data are more than cleaned data, ready for analysis. Tidy data are formatted a particular way. Tidy data basic requirements are:

  1. One column per variable. Each variable is contained in one column.
  2. One row per observation. Each different observation of that variable should be in a different row.
    • Said another way: tidy data are long (not wide). If the same variable is observed twice or more times for each person in a study all of the observations of that variable are stored in one column. This also means that each row should be uniquely identifiable with a subject (person) and time or occasion of measurement variable.
  3. Identifying variables are identified. In this case by "identifying" I mean something like a study identifier number, as opposed to personal health identifying information (name, patient ID number, medical record number, social security number, telephone number, etc.) There must be a set of variables that allow the linking of observations across multiple tables.
  4. Missing values are encoded to "NA" or system missing values.
  5. A codebook is provided. This will include variable meta-data, including:
    • Variable labels. Include a table that lists variable names and full descriptive labels with the specificity and detail that would be included as a label in a figure in a publication table or figure (e.g., include units, direction, such as "Geriatric depression scale [0-15, higher worse]", or "Body mass index [kg/m2]").
    • Value labels. Include a table that lists variable value labels; i.e., the verbal descriptors attached to nominal (e.g., male: 0=female, 1=male) or ordinal (e.g., LowEnergy, 0=no, 1=some, 2=a lot, 3=extremely).

If the tidy data shared as SPSS, SAS or Stata format, then requirement #5 can be met with fully specified format files or formatted data with variable and value labels. But if the data are shared as EXCEL files, CSV files, or R data files, then separate tables with variable and value labels must be provided.

Code book

For almost any data set, the measurements you calculate will need to be described in more detail than you can or should sneak into the spreadsheet. The code book contains this information. At minimum it should contain:

  1. Information about the variables (including units) in the data set not contained in the tidy data
  2. Information about the summary choices you made
  3. Information about the experimental study design you used

If the project has followed our advice and created derived variables using variable definition sheets (VDS), the including these variable definition sheets should be sufficient as the above information should be included in a fully specified VDS.

Format

It does not matter if the data are formatted for a specific software package (SAS, SPSS, Stata, R) or sent in an Excel spreadsheet, or a comma delimited text file (CSV). It is probably better to send a formatted data set with variable and value labels.

Raw data

Include the rawest form of the data possible. This is data as close to the initial recording of observations as can be stored and shared electronically. Data without restrictions, modifications, etc.

Personal identifying information

The only exception is: consider, if possible, omitting personal identifying information (see this advice from the HHS). This includes but is not limited to names, dates, social security numbers, medical record numbers, telephone numbers. If personal identifying information is shared, special permissions will need to be secured.

Example

Here is an example of a tidy data data set prepared in Excel, the way QSP will like it. Notice that there are two worksheets, one called "Data" the other "Dictionary".

https://s3.amazonaws.com/quantsci/Misc/Auto_Stata.xlsx

The Dictionary column names come from a standard REDCap data dictionary (but the order is modified). Just ignore columns that are confusing or nonsensical.

Notice in the Dictionary how value label information is conveyed: The values for the variable "foreign", which has observed values of 0 and 1, is provided "0, Domestic | 1, Foreign". So that is, in quotes, <first value><comma><space><descriptor more than one word is fine><pipe or bar character><second value><comma> . . .

Acknowledgements

The content on this page is almost entirely due to Jeff Leek and colleagues (https://github.com/jtleek/datasharing, and Leek, J. (2015). Tidying the data. In J. Leek (Ed.), The Elements of Data Analytic Style. Baltimore: Leanpub. https://leanpub.com/datastyle). It was adapted by Rich Jones for QSP.

Where these recommendations differ from Jeff Leek's

  • I add the specification that personal identifying information should be omitted from raw data
  • Jeff Leek (JL) want's data in EXCEL workbooks to only use the first spreadsheet, QSP does not make this a requirement.
  • JL wants nominal ordinal data encoded in strings. But I think that can be dangerous if the strings are ambiguously ordered (as can sometimes occur with Likert scale responses, or compound questions). So a whole section from JL's guide "How to code variables" is not represented in this document, and those recommendations are not carried forward here. But I do add the requirement that the label of numeric values be supplied.
  • JL wants meaningful variable names. He suggests that tidy data should include a row at the top of each data table/spreadsheet that contains full row names (e.g., use AgeAtDiagnosis instead of ADx. I suggest instead that a table of variable labels be included (or that variable labels be assigned in the source data set if such is possible. Users of SPSS, SAS, and Stata will be familiar with variable labels. Old school R users suffer from a lack of a capability of incorporating variable labels in base R. I believe that is the source of JL's requirement for tidy data. I don't think a descriptive variable name is a substitute for a label, even in R. Not having labels that would suit for tables and figures can slow down the process of statistical consulting. These should be worked out before data are shared and included with a tidy data set.
  • I also omit JL's section on including the inclusion script.
  • JL has as a requirement that different data types should be organized in separate tables. We do not make this a requirement.
  • I add a requirement for missing data handling in TIDY data

Another great description of tidy data can be found here (http://vita.had.co.nz/papers/tidy-data.pdf)

Rich Jones

11 October 2019