Collect & Capture
Research data are very much about when they are used as well as what they constitute and the purpose for which they are to be used
The WHAT questions
Research data are very much about when they are used as well as what they constitute and the purpose for which they are to be used
New to using data? View the UK Data Services for assistance
Types of data
Research data exist in many different forms: Textual, numerical, databases, geospatial, images, audio-visual recordings and data generated by machines or instruments. Digital data exists in specific file formats, which are coded so that a software programme can read and interpret these data.
Using standard and interchangeable or open lossless data formats ensures longer-term usability of data. For long term preservation, digital data is converted to such formats. UK Data Service
Research data can be classified in different ways, for example based on their:
Content: numerical, textual, audiovisual, multimedia, models, computer programmes
Format: spreadsheets, databases, images, maps, audio files, (un)structured text, Original literary, dramatic, musical or artistic works, sound recordings, films, broadcasts
Mode of data collection: experimental, observational, simulation, derived/compiled from other sources
Digital (born-digital or digitized) or non-digital nature (e.g. paper surveys, notes…)
Primary (generated by the researcher for a particular research purpose or project) or secondary nature (originally created by someone else for another purpose). To prepare data for secondary research, researchers should document data appropriately. They should also explain the procedures and fieldwork methods, the objectives and methodology of the research, and explicitly describe the meanings of variables and codes used. Additionally, they should describe any derivation, transformations, de-identification (pseudonymisation/anonymisation) or data cleaning carried out.
Situational data – image or video that exists already but when used in research, it become situational data for that researcher. Situational data can also be created by researchers for one purpose and used by another set of researchers at a later date for a completely different research agenda
Raw or processed nature
Secondary data
Data should be managed so that any researcher can discover, use and interpret the data after a period of time has passed.
Making use of data in this way falls under the domain of secondary data
To prepare data for secondary research, researchers should document data appropriately. They should also explain the procedures and fieldwork methods, the objectives and methodology of the research, and explicitly describe the meanings of variables and codes used. Additionally, they should describe any derivation, transformations, de-identification (pseudonymisation/anonymisation) or data cleaning carried out.
They should also ensure that data are held in an organised manner. Documentation is invaluable in enabling secondary users to contextualise data and conduct better, informed re-use of the material. UK Data Service
MANTRA - John MacInnes - Primary data versus secondary data
4 May 2012
MANTRA - John MacInnes - Issues with secondary data
4 May 2012
Research data formats
A file format is a standard way that information is encoded for storage in a computer file. It specifies how bits are used to encode information in a digital storage medium. File formats may be either proprietary or free. Wikipedia
Data files should be clearly named, well organised, structured and quality, and version-controlled throughout the research. It is vital to develop suitable procedures before data gathering starts in order to adhere to any conventions, instructions, guidelines or templates that will help to ensure quality and consistency across a data collection. UK Data Service
A file format describes how information is stored within a digital file. Although each file format is unique, different file formats exist for similar types of information (e.g. text can be stored in a plain text file as well as in a word file).
On most computer systems, the format of a file is indicated by the ‘extension’ in the filename (e.g. .txt, .csv). The extension provides an immediate clue about the type of data within a file. For example, we expect that a file with a .jpg extension is an image, whereas a .docx should contain formatted text..
Simple vs complex formats: e.g. the .txt format is a very simple way of storing text, while a .docx file has more complex properties.
Examples of recommended file formats for different types of data can be found via:
DANS, File formats
UK Data Service, Recommended formats
Choosing the file format
The format of the electronic data files you work with during your research may be determined by the research equipment and computer hardware and software that you have access to. However, for long-term preservation and ease of sharing, best practices may dictate that the files be converted to a different format after your project has ended. Give some thought to this eventuality at the outset. Considerations include:
Will your data be in a format that requires proprietary software to access it?
If you will be depositing your data in a repository at the end of your project, does the repository have specific guidelines or requirements with respect to file format?
What features of your data might be lost or modified in the conversion to another file format?
Stanford University Libraries - Data Management Services provides a useful overview of preferred file formats. From the Stanford resource:
Containers: TAR, GZIP, ZIP
Databases: XML, CSV
Geospatial: SHP, DBF, GeoTIFF, NetCDF
Moving images: MOV, MPEG, AVI, MXF
Sounds: WAVE, AIFF, MP3, MXF
Statistics: ASCII, DTA, POR, SAS, SAV
Still images: TIFF, JPEG 2000, PDF, PNG, GIF, BMP
Tabular data: CSV
Text: XML, PDF/A, HTML, ASCII, UTF-8
Web archive: WARC
Additional helpful guidelines for selecting file formats can be found at these websites:
Choosing formats (Cambridge University Libraries - Data Management)
File formats (Cornell University - Research Data Management Service Group)
Library of Congress File Formats (Library of Congress)
The WHY questions
Research data are very much about when they are used as well as what they constitute and the purpose for which they are to be used. UK Data Service
Data Capture
Why document data?
Now where did I put that file?
Finding and reusing your data will be easier, both for you and for other researchers, if you give a little thought early in the process to how you will name your data files and what file formats you will use to store your data. If you are planning to archive or share your data, you will also want to consider best practices for describing your data.
A crucial part of ensuring that research data can be shared and reused by a wide range of researchers for a variety of purposes is by taking care that those data are accessible, understandable and (re)usable.
This requires clear and detailed data description and annotation. Besides the information that is needed to reuse the data, data also need to be accompanied by information for citing and discovering the data.
The comprehensive description of the data and contextual information that future researchers need to understand and use the data.
Documentation deposited alongside data files should enable users, with no prior knowledge of the research project and data collected, to understand exactly how the research was carried out and what the data mean, in order to (re)use the data correctly in their respective projects and for their respective purposes.
Original researchers wishing to return to their data some time later, or new users wanting to use data, need sufficient contextual and explanatory information to make sense of those data.
Research data should always be accompanied with documentation because it:
Enables you to understand/interpret data later
Makes data independently understandable, i.e. reusable
Make results independently reproducible, starting from raw data
Helps avoid incorrect use/misinterpretation
As such, documentation is an essential step in making your data FAIR.
File Naming
A File Naming Convention (FNC) is a framework for naming your files in a way that describes what they contain and how they relate to other files.
A file naming convention (FNC) can help you stay organized by making it easy to identify the file(s) that contain the information that you are looking for just from its title and by grouping files that contain similar information close together. A good FNC can also help others better understand and navigate through your work. Purdue University
File name elements
It is advocated researchers decide on a naming convention for files at the start of the research project.
File names can be constructed using the following elements:
Give files a meaningful name. A file name might include a combination of elements, such as type of equipment used, date, and researcher's surname.
Decide on the best order for elements in a file name; it will affect how the files are sorted.
Project acronym. Keep names a reasonable length; some applications won't work well with long file names. A maximum of 25 characters is a good rule of thumb.
To separate elements in a file name, consider using underscores (_) or hyphens (-). Avoid using blank spaces in a file name. Use periods only to separate the file name from the file type extension (.txt, .jpg, etc.)
Content description
Date. If including date as part of the file name, use the standard format yyyymmdd to ensure that files sort in chronological order.
If your file name will include a numerical component, such as a subject number or version number, use leading zeros (001, 002, etc.) so that files sort in sequential order.
Location
Creator name/initials
Status information (i.e. draft or final)
Avoid special characters like ~ ! @ # $ % ^ & * ( ) ` ; < > ? , [ ] { } ‘ “
Example: CONS_INT1_12-03-2019.rtf.
Version Control
Versioning refers to saving new copies of your files when you make changes so that you can go back and retrieve specific versions of your files later. Saving multiple versions makes it possible to decide at a later time that you prefer an earlier version. You can then immediately revert back to that version instead of having to retrace your steps to recreate it. University of Pittsburgh
Version control is a good research practice in the collaborative research environment. UK Data Service
When you work with different versions of a file, it can be a challenge to locate the 'correct' version or to know how versions differ from each other. If not done well, it can even be difficult to know which file preceded the other.
The matter is even complicated further when files are kept in multiple locations, and multiple users edit these files. To avoid confusion and safeguard against accidental loss, a versioning system can be put in place.
Example:
HealthTest-2008-04-06.docx
HealthTest-v02.docx
In its most basic form, versioning relies on a sequential numbering system. Within a given version number category (major, minor), these numbers are generally assigned in increasing order and correspond to changes in the data. The US Geological Survey recommends the following structure:
DataFileName_1.0 = original document
DataFileName_1.1 = original document with minor revisions
DataFileName_2.0 = document with substantial revisions
MANTRA - Richard Rodger - Organising data
13 May 2014
MANTRA - Richard Rodger - Organising data (short)
1 May 2020
MANTRA - Stephen Lawrie - File transformation
30 May 2014
MANTRA - Jeff Haywood - Importance of good file management in research
30 November 2011
MANTRA - Lynn Jamieson - Documenting research data
4 May 2012
MANTRA - Lynn Jamieson - Importance of documenting data in research
4 May 2012
John MacInnes - Tips on Documentation
3 June 2014
MANTRA - John MacInnes - Data documentation in secondary data analysis
4 May 2012
Metadata
What is metadata?
Metadata is "data that provides information about other data", but not the content of the data, such as the text of a message or the image itself. There are many distinct types of metadata, including: Descriptive metadata – the descriptive information about a resource. It is used for discovery and identification. Wikipedia
Why is metadata so important?
Metadata which is ascribed to data by Librarians and publishers is done so in accordance with the international standards and ISOs.
The data descriptors include:
Name of the data set
describe the digital context
Brief description of the data
Includes any existing data or third-party sources that will be used
Content, type and coverage
Metadata standards & schemes
ISO 19115 for geographic information
Data Documentation Initiative (DDI)
Statistical Data and Metadata eXchange (SDMX)
Metadata Encoding and Transmission Standard (METS)
General International Standard Archival Description (ISAD(G))
DataCite metadata schema for the publication and citation of digital datasets with a persistent identifier.
How will you ensure data quality assurance ?
Librarians, publishers and data scientists make use of internationally recognised metadata practices which ensures the data is described to...
be read and interpreted in the future
consistently follow file naming conventions
follow version history and dates
list related data sets
validate the data
perform basic quality assurance and quality control on the data throughout the research project
identify values that are estimated
double-check data that are entered by hand (preferably entered by more than one person)
use quality level flags to indicate potential problems
check the format of the data is consistent across the data set
identify all stakeholders
identify data instruments, spatial descriptors
identify how to cite the data/data set
Metadata represents data about data. Metadata enriches the data with information that makes it easier to find, use and manage. For instance, HTML tags define layout for human readers. Semantic metadata helps computers to interpret data by adding references to concepts in a knowledge graph.
Metadata are an important subset of core data documentation
Collating and recording metadata is important for the purposes of cataloguing, citing, discovering and retrieving data collections. Metadata are a subset of core data documentation providing standardised, structured information.
Metadata are intended for reading by machines, and help to explain the purpose, origin, time references, geographic location, creator, access conditions and terms of use of a data collection. Without this essential documentation, collections become of limited value simply because researchers and reusers will not be able to search for or cite the data collection. UK Data Service
Data Sharing
Do your chosen formats and software enable sharing and long-term access to the data?
A crucial part of ensuring that research data can be shared and reused by a wide range of researchers for a variety of purposes is by taking care that those data are accessible, understandable and (re)usable.
This requires clear and detailed data description and annotation. Besides the information that is needed to reuse the data, data also need to be accompanied by information for citing and discovering the data. UK Data Service
Collaborative research
Collaborative research brings additional data management challenges for providing shared storage, access and the transfer of research data across the various partners or institutions. UK Data Service
Accessible data & authentication
The list of typical requirements for researchers working in a collaborative environment.
Storage and the sharing of documents, plus data files.
The ability to organise documents and data files into folders.
An access control system, which allows authentication and authorisation to be easily managed.
Version control of documents and data files.
File locking to prevent users from simultaneously working on the same file.
Ideally, a discussion platform utilising a forum or wiki format.
Why consider data copyright?
Copyright is essential for data sharing and fair dealing
When data are shared or archived, the original copyright owner retains the copyright. UK Data Service
A data archive cannot archive data unless all rights holders are identified and give their permission for the data to be shared. Secondary users need to obtain copyright clearance before data can be reproduced. However, exceptions exist under the fair dealing concept. UK Data Service
Research Data ownership
Creative Commons is a nonprofit organization that helps overcome legal obstacles to the sharing of knowledge and creativity to address the world’s pressing challenges.
Authors give away the copyright rights to their work to the publisher when the article is published in the traditional publication process.
However, when authors publish their work via the Open Access process, they retain the copyright of that work. It is important that authors assign a Creative Commons license to determine how their work may be used and shared.
The WHEN questions
Research data are very much about when they are used as well as what they constitute and the purpose for which they are to be used. UK Data Service
Research Lifecycle
Data is collected, captured, managed, stored and preserved throughout the research process
The HOW questions
What standards or methodologies will be used?
How data is managed depends on the types of data involved, how data is collected and stored, and how it is used - throughout the research lifecycle
Observational
Observational data is
captured in real time
usually unique
irreplaceable
e.g. brain images, survey data”
Derived or compiled
Derived or compiled data is a result from processing or combining 'raw' data, often reproducible but expensive e.g. compiled databases, text mining, aggregate census data. UK Data Service
Reference or canonical
Reference or canonical data is a (static or organic) conglomeration or collection of smaller (peer reviewed) datasets, most probably published and curated e.g. gene databanks, crystallographic databases UK Data Service
Experimental data
Experimental data is
data from experimental results, e.g. from lab equipment
often reproducible, but can be expensive e.g. chromatograms, microassays
Simulation
Simulation data is data generated from test models where model and metadata may be more important than output data from the model e.g. economic or climate models:
Qualitative
MANTRA - Lynn Jamieson - Challenges in working with qualitative data
4 May 2012
This unit introduces you to concepts around data, what constitutes research data, and the multiple forms of data that make up the digital world.
After completing this unit you will:
▪ Be able to distinguish between various types of research data.
▪ Recognise the importance of managing research materials.
▪ Be aware of challenges presented by data in society.
▪ Understand the need for data science and data literacy.
The aim of this unit is to introduce you to the concepts of research data organisation, explain why it is important, and what constitutes good data file management.
After completing this unit you will:
▪ Appreciate why research data organisation is important as your project grows.
▪ Understand data file naming, re-naming and versioning conventions.
▪ Be prepared to manage your code and track workflows to make them shareable and reproducable.
▪ See how electronic lab notebooks can support the collaborative research process.
This unit introduces you to the concepts of documentation and metadata.
After completing this unit you will:
▪ Understand why documenting your research data is important, and why documentation is important for future users of the data.
▪ Know why and when to use metadata.
▪ Understand the importance of citing data, and how to do it.
This unit introduces you to the concepts of data file formatting, compression, normalisation, and other kinds of data transformation and why they are useful.
After completing this unit you will:
▪ Understand why research data formatting and transformation is important.
▪ Know how to make decisions about data file formatting, compression, normalisation and other transformations.
▪ Use the information featured in the course to improve your research data management practice.
The UK Data Service provide various training opportunities
There is a wealth of data available for reuse in research and reports. These free, interactive tutorials are designed for anyone who wants to start using secondary data. They show you how to get started with finding good quality data, understanding it and starting your analyses.
Best practice and training for researchers new to accessing and using data in our collection. Includes advice and tools to correctly cite data; student-specific information on our Dissertation Award for undergraduates; and more.
Survey data, including data from long-running surveys, series and longitudinal studies, are a major part of social science research. Learn how to use survey and longitudinal data through training resources including videos, on-demand webinars and written guides.
Qualitative research gives a voice to the lived experience, offering researchers a deeper insight into a topic or individuals’ experiences. Qualitative data can be combined with quantitative to enhance understanding around a policy or topic in a way that quantitative data by itself often cannot.
New technologies, resources and methods are constantly changing how researchers interact with and use data. This section provides the latest insights, learning materials and practical advice on rapidly developing techniques including modelling, simulation, big data, web-scraping, social media and more.
Learn how to use updateable subnational population information to assess the impact of policies and tackle area-based issues, such as neighbourhood deprivation and poor health. Guidance on generating local survey estimates and mapping data from key data such as the Census and UK Household Longitudinal Study
Our international macrodata contain socio-economic time series data aggregated to a country or regional level for a range of countries over a substantial time period.