The goals of this guide are to enhance the scientific rigor, transparency, facilitate reproducibility, and speed turnaround time for statistical consultation.
1. Tidy data (ideally without personal identifying information)
2. A code book for the tidy data set
3. Raw data (and soft copies of data collection forms if available)
4. Documentation explaining how raw data are transformed to tidy data, including any restrictions and new variable creation (which can be described with variable definition sheets, VDS) that describe how raw data are related to tidy data.
Tidy data are more than cleaned data, ready for analysis. Tidy data are formatted a particular way. Tidy data basic requirements are:
If the tidy data shared as SPSS, SAS or Stata format, then requirement #5 can be met with fully specified format files or formatted data with variable and value labels. But if the data are shared as EXCEL files, CSV files, or R data files, then separate tables with variable and value labels must be provided.
For almost any data set, the measurements you calculate will need to be described in more detail than you can or should sneak into the spreadsheet. The code book contains this information. At minimum it should contain:
If the project has followed our advice and created derived variables using variable definition sheets (VDS), the including these variable definition sheets should be sufficient as the above information should be included in a fully specified VDS.
It does not matter if the data are formatted for a specific software package (SAS, SPSS, Stata, R) or sent in an Excel spreadsheet, or a comma delimited text file (CSV). It is probably better to send a formatted data set with variable and value labels.
Include the rawest form of the data possible. This is data as close to the initial recording of observations as can be stored and shared electronically. Data without restrictions, modifications, etc.
The only exception is: consider, if possible, omitting personal identifying information (see this advice from the HHS). This includes but is not limited to names, dates, social security numbers, medical record numbers, telephone numbers. If personal identifying information is shared, special permissions will need to be secured.
Here is an example of a tidy data data set prepared in Excel, the way QSP will like it. Notice that there are two worksheets, one called "Data" the other "Dictionary".
https://s3.amazonaws.com/quantsci/Misc/Auto_Stata.xlsx
The Dictionary column names come from a standard REDCap data dictionary (but the order is modified). Just ignore columns that are confusing or nonsensical.
Notice in the Dictionary how value label information is conveyed: The values for the variable "foreign", which has observed values of 0 and 1, is provided "0, Domestic | 1, Foreign". So that is, in quotes, <first value><comma><space><descriptor more than one word is fine><pipe or bar character><second value><comma> . . .
The content on this page is almost entirely due to Jeff Leek and colleagues (https://github.com/jtleek/datasharing, and Leek, J. (2015). Tidying the data. In J. Leek (Ed.), The Elements of Data Analytic Style. Baltimore: Leanpub. https://leanpub.com/datastyle). It was adapted by Rich Jones for QSP.
Where these recommendations differ from Jeff Leek's
Another great description of tidy data can be found here (http://vita.had.co.nz/papers/tidy-data.pdf)
Rich Jones
11 October 2019