Collect &amp; Capture

Secondary data

Data should be managed so that any researcher can discover, use and interpret the data after a period of time has passed.

Making use of data in this way falls under the domain of secondary data

To prepare data for secondary research, researchers should document data appropriately. They should also explain the procedures and fieldwork methods, the objectives and methodology of the research, and explicitly describe the meanings of variables and codes used. Additionally, they should describe any derivation, transformations, de-identification (pseudonymisation/anonymisation) or data cleaning carried out.

They should also ensure that data are held in an organised manner. Documentation is invaluable in enabling secondary users to contextualise data and conduct better, informed re-use of the material. UK Data Service

MANTRA - John MacInnes - Primary data versus secondary data

4 May 2012

MANTRA - John MacInnes - Issues with secondary data

4 May 2012

Research data formats

A file format is a standard way that information is encoded for storage in a computer file. It specifies how bits are used to encode information in a digital storage medium. File formats may be either proprietary or free. Wikipedia

Data files should be clearly named, well organised, structured and quality, and version-controlled throughout the research. It is vital to develop suitable procedures before data gathering starts in order to adhere to any conventions, instructions, guidelines or templates that will help to ensure quality and consistency across a data collection. UK Data Service

A file format describes how information is stored within a digital file. Although each file format is unique, different file formats exist for similar types of information (e.g. text can be stored in a plain text file as well as in a word file).

On most computer systems, the format of a file is indicated by the ‘extension’ in the filename (e.g. .txt, .csv). The extension provides an immediate clue about the type of data within a file. For example, we expect that a file with a .jpg extension is an image, whereas a .docx should contain formatted text..

Simple vs complex formats: e.g. the .txt format is a very simple way of storing text, while a .docx file has more complex properties.

Examples of recommended file formats for different types of data can be found via:

DANS, File formats
UK Data Service, Recommended formats

Choosing the file format

The format of the electronic data files you work with during your research may be determined by the research equipment and computer hardware and software that you have access to. However, for long-term preservation and ease of sharing, best practices may dictate that the files be converted to a different format after your project has ended. Give some thought to this eventuality at the outset. Considerations include:

Will your data be in a format that requires proprietary software to access it?
If you will be depositing your data in a repository at the end of your project, does the repository have specific guidelines or requirements with respect to file format?
What features of your data might be lost or modified in the conversion to another file format?

Stanford University Libraries - Data Management Services provides a useful overview of preferred file formats. From the Stanford resource:

Containers: TAR, GZIP, ZIP
Databases: XML, CSV
Geospatial: SHP, DBF, GeoTIFF, NetCDF
Moving images: MOV, MPEG, AVI, MXF
Sounds: WAVE, AIFF, MP3, MXF
Statistics: ASCII, DTA, POR, SAS, SAV
Still images: TIFF, JPEG 2000, PDF, PNG, GIF, BMP
Tabular data: CSV
Text: XML, PDF/A, HTML, ASCII, UTF-8
Web archive: WARC

Additional helpful guidelines for selecting file formats can be found at these websites:

Choosing formats (Cambridge University Libraries - Data Management)
File formats (Cornell University - Research Data Management Service Group)
L ibrary of Congress File Formats (Library of Congress)

The WHY questions

Research data are very much about when they are used as well as what they constitute and the purpose for which they are to be used. UK Data Service

Data Capture

Why document data?

Now where did I put that file?

Finding and reusing your data will be easier, both for you and for other researchers, if you give a little thought early in the process to how you will name your data files and what file formats you will use to store your data. If you are planning to archive or share your data, you will also want to consider best practices for describing your data.

A crucial part of ensuring that research data can be shared and reused by a wide range of researchers for a variety of purposes is by taking care that those data are accessible, understandable and (re)usable.

This requires clear and detailed data description and annotation. Besides the information that is needed to reuse the data, data also need to be accompanied by information for citing and discovering the data.

The comprehensive description of the data and contextual information that future researchers need to understand and use the data.

Documentation deposited alongside data files should enable users, with no prior knowledge of the research project and data collected, to understand exactly how the research was carried out and what the data mean, in order to (re)use the data correctly in their respective projects and for their respective purposes.

Original researchers wishing to return to their data some time later, or new users wanting to use data, need sufficient contextual and explanatory information to make sense of those data.

Research data should always be accompanied with documentation because it:

Enables you to understand/interpret data later
Makes data independently understandable, i.e. reusable
Make results independently reproducible, starting from raw data
Helps avoid incorrect use/misinterpretation

As such, documentation is an essential step in making your data FAIR.

File Naming

A File Naming Convention (FNC) is a framework for naming your files in a way that describes what they contain and how they relate to other files.

A file naming convention (FNC) can help you stay organized by making it easy to identify the file(s) that contain the information that you are looking for just from its title and by grouping files that contain similar information close together. A good FNC can also help others better understand and navigate through your work. Purdue University

File name elements

It is advocated researchers decide on a naming convention for files at the start of the research project.

File names can be constructed using the following elements:

Give files a meaningful name. A file name might include a combination of elements, such as type of equipment used, date, and researcher's surname.
Decide on the best order for elements in a file name; it will affect how the files are sorted.
Project acronym. Keep names a reasonable length; some applications won't work well with long file names. A maximum of 25 characters is a good rule of thumb.
To separate elements in a file name, consider using underscores (_) or hyphens (-). Avoid using blank spaces in a file name. Use periods only to separate the file name from the file type extension (.txt, .jpg, etc.)
Content description
Date. If including date as part of the file name, use the standard format yyyymmdd to ensure that files sort in chronological order.
If your file name will include a numerical component, such as a subject number or version number, use leading zeros (001, 002, etc.) so that files sort in sequential order.
Location
Creator name/initials
Status information (i.e. draft or final)
Avoid special characters like ~ ! @ # $ % ^ & * ( ) ` ; < > ? , [ ] { } ‘ “

Example: CONS_INT1_12-03-2019.rtf.

Version Control

Versioning refers to saving new copies of your files when you make changes so that you can go back and retrieve specific versions of your files later. Saving multiple versions makes it possible to decide at a later time that you prefer an earlier version. You can then immediately revert back to that version instead of having to retrace your steps to recreate it. University of Pittsburgh

Version control is a good research practice in the collaborative research environment. UK Data Service

When you work with different versions of a file, it can be a challenge to locate the 'correct' version or to know how versions differ from each other. If not done well, it can even be difficult to know which file preceded the other.

The matter is even complicated further when files are kept in multiple locations, and multiple users edit these files. To avoid confusion and safeguard against accidental loss, a versioning system can be put in place.

Example:

HealthTest-2008-04-06.docx
HealthTest-v02.docx

In its most basic form, versioning relies on a sequential numbering system. Within a given version number category (major, minor), these numbers are generally assigned in increasing order and correspond to changes in the data. The US Geological Survey recommends the following structure:

DataFileName_1.0 = original document

DataFileName_1.1 = original document with minor revisions

DataFileName_2.0 = document with substantial revisions

MANTRA - Richard Rodger - Organising data

13 May 2014

MANTRA - Richard Rodger - Organising data (short)

1 May 2020

MANTRA - Stephen Lawrie - File transformation

30 May 2014

MANTRA - Jeff Haywood - Importance of good file management in research

30 November 2011

MANTRA - Lynn Jamieson - Documenting research data

4 May 2012

MANTRA - Lynn Jamieson - Importance of documenting data in research

4 May 2012

John MacInnes - Tips on Documentation

3 June 2014

MANTRA - John MacInnes - Data documentation in secondary data analysis

4 May 2012

Metadata

What is metadata?

Metadata is "data that provides information about other data", but not the content of the data, such as the text of a message or the image itself. There are many distinct types of metadata, including: Descriptive metadata – the descriptive information about a resource. It is used for discovery and identification. Wikipedia

Why is metadata so important?

Metadata which is ascribed to data by Librarians and publishers is done so in accordance with the international standards and ISOs.

The data descriptors include:

Name of the data set
describe the digital context
Brief description of the data
Includes any existing data or third-party sources that will be used
Content, type and coverage

ISO 19115 for geographic information

Metadata standards & schemes

Dublin Core

Data Documentation Initiative (DDI)

Statistical Data and Metadata eXchange (SDMX)

Metadata Encoding and Transmission Standard (METS)

General International Standard Archival Description (ISAD(G))

DataCite metadata schema for the publication and citation of digital datasets with a persistent identifier.

How will you ensure data quality assurance ?

Librarians, publishers and data scientists make use of internationally recognised metadata practices which ensures the data is described to...

be read and interpreted in the future
consistently follow file naming conventions
follow version history and dates
list related data sets
validate the data
perform basic quality assurance and quality control on the data throughout the research project
identify values that are estimated
double-check data that are entered by hand (preferably entered by more than one person)
use quality level flags to indicate potential problems
check the format of the data is consistent across the data set
identify all stakeholders
identify data instruments, spatial descriptors
identify how to cite the data/data set

Metadata represents data about data. Metadata enriches the data with information that makes it easier to find, use and manage. For instance, HTML tags define layout for human readers. Semantic metadata helps computers to interpret data by adding references to concepts in a knowledge graph.

Metadata are an important subset of core data documentation

Collating and recording metadata is important for the purposes of cataloguing, citing, discovering and retrieving data collections. Metadata are a subset of core data documentation providing standardised, structured information.

Metadata are intended for reading by machines, and help to explain the purpose, origin, time references, geographic location, creator, access conditions and terms of use of a data collection. Without this essential documentation, collections become of limited value simply because researchers and reusers will not be able to search for or cite the data collection. UK Data Service

Data Sharing

Do your chosen formats and software enable sharing and long-term access to the data?

Collaborative research

Collaborative research brings additional data management challenges for providing shared storage, access and the transfer of research data across the various partners or institutions. UK Data Service

Accessible data & authentication

The list of typical requirements for researchers working in a collaborative environment.

Storage and the sharing of documents, plus data files.
The ability to organise documents and data files into folders.
An access control system, which allows authentication and authorisation to be easily managed.
Version control of documents and data files.
File locking to prevent users from simultaneously working on the same file.
Ideally, a discussion platform utilising a forum or wiki format.

Choose the Creative Commons license which is right for you!

Why consider data copyright?

When data are shared or archived, the original copyright owner retains the copyright. UK Data Service

A data archive cannot archive data unless all rights holders are identified and give their permission for the data to be shared. Secondary users need to obtain copyright clearance before data can be reproduced. However, exceptions exist under the fair dealing concept. UK Data Service

Research Data ownership

SPARC

To help inform our members and the broader community regain and maintain community ownership over data and data infrastructure.

Creative Commons is a nonprofit organization that helps overcome legal obstacles to the sharing of knowledge and creativity to address the world’s pressing challenges.

Authors give away the copyright rights to their work to the publisher when the article is published in the traditional publication process.

However, when authors publish their work via the Open Access process, they retain the copyright of that work. It is important that authors assign a Creative Commons license to determine how their work may be used and shared.

The WHEN questions

Research data are very much about when they are used as well as what they constitute and the purpose for which they are to be used. UK Data Service

Research Lifecycle

Data is collected, captured, managed, stored and preserved throughout the research process

The HOW questions

What standards or methodologies will be used?

How data is managed depends on the types of data involved, how data is collected and stored, and how it is used - throughout the research lifecycle

Observational

Observational data is

captured in real time
usually unique
irreplaceable
e.g. brain images, survey data”

Survey Data

Qualitative data

Derived or compiled

Derived or compiled data is a result from processing or combining 'raw' data, often reproducible but expensive e.g. compiled databases, text mining, aggregate census data. UK Data Service

Reference or canonical

Reference or canonical data is a (static or organic) conglomeration or collection of smaller (peer reviewed) datasets, most probably published and curated e.g. gene databanks, crystallographic databases UK Data Service

Experimental data

Experimental data is

data from experimental results, e.g. from lab equipment
often reproducible, but can be expensive e.g. chromatograms, microassays

Geographic Information System (GIS)

Mapping & longitudinal

Geodata

Geography and data

Computational social science

Simulation

Simulation data is data generated from test models where model and metadata may be more important than output data from the model e.g. economic or climate models: