Project: A project is a set of activities required to produce certain defined outputs, or to accomplish specific goals or objectives within a defined schedule and resource budget. A project exists only for the duration of time required to complete its stated objectives. (Reference: Termium Plus)
Data (short definition): Data are a set of values of subjects with respect to qualitative or quantitative variables representing facts, statistics, or items of information in a formalized manner suitable for communication, reinterpretation, or processing (TBS Policy on Service and Digital, 2019).
Data (long definition): Data are facts, measurements, recordings, records, or observations about the world collected by scientists and others with a minimum of contextual interpretation. Data may be in any format or medium taking the form of writings, notes, numbers, symbols, text, images, films, video, sound recordings, pictorial reproductions, drawings, designs or other graphical representations, procedural manuals, forms, diagrams, work flow charts, equipment descriptions, data files, data processing algorithms/code/scripts, or statistical records (CODATA-IRiDiuM (2018) -International Research Data Management glossary).
The word “data” may be used very broadly to comprise data (in the strict sense) and the ecosystem of digital things that relate to data, including metadata, software and algorithms, as well as physical samples and analogue artefacts - and the digital representations and metadata relating to these things. (CODATA 2019 - Beijing Declaration on Research Data). There are dozens of other definitions of data that may be useful depending on the context.
Some other examples of definitions of data. DAMA-UK: Data are a re-interpretable representation of information in a formalised manner suitable for communication, interpretation or processing. New Oxford Learner’s Dictionary (2021): Data are facts or information, especially when examined and used to find out things or to make decisions. American Society for Quality (ASQ): Data are a set of collected facts. A set of collected facts. There are two basic kinds of numerical data, measured or variable data. International Standards Organization (ISO 11179): Data are re-interpretable representation of information in a formalized manner suitable for communication, interpretation, or processing. Statistics Canada: Data are facts, figures, observations, or recordings that can take the form of image, sound, text or physical measurements. Data can be gathered and processed to form conclusions. Data can come from many sources, and they can be split into two groups based on the form they take: structured data and unstructured data.
Scientific data are data that are used by researchers and scientists as primary sources to support technical and regulatory development or scientific enquiry, research, or scholarship, and that are used as evidence in the scientific process and/or are commonly accepted in the scientific community as necessary to validate scientific findings and results. All other digital and non-digital content have the potential of becoming research data. Examples of scientific data include data arising from: experiments, research and development, ‘citizen science’, surveys, operations, surveillance, monitoring, field analyzers or data-loggers, instruments, laboratory analyses, inventories, modeling and simulation output, processed data, and repurposed data. The scientific nature of the data is demonstrated when the process of creating, maintaining and quality-proofing the data comply with commonly recognized scientific standards.
Although scientific data share many aspects in common with other types of data (e.g., administrative data, financial data, business data), their processing frequently requires more complex software and infrastructure. The data themselves may be:
more complex (e.g., associated accuracy, precision, detection limits, confidence intervals, quality assurance/quality control procedures, etc.);
more tightly controlled;
held to higher standards;
retained for a longer period of time, often indefinitely;
documented more carefully and in greater detail (e.g., description of methods used to obtain measurements, etc.); and,
used as scientific evidence, requires a higher level of credibility, reliability, validity, and accessibility.
FAIR Data (Findable, Accessible, Interoperable, and Reusable): FAIR principles emphasise machine-actionability i.e., the capacity of computational systems to find, access, interoperate, and reuse data with no or minimal human intervention (GoFAIR, Wellcome Open Research). However, FAIR data do not guarantee either ethical or reproducible data.
FAIRER Data (FAIR + Ethical + Reproducible): While the FAIR principles have become a guiding technical resource for data sharing, legal and socio-ethical considerations are equally important for a FAIR data ecosystem for further uses of data (CINECA 2023). FAIR data should be FAIRER, including also ethical and reproducible as key components (CINECA 2023, G7 OSWG 2023). Data in particular fields may have additional data qualities (e.g., real-time, resilient, high availability, timely, equitable, spatial, timeseries, connected, etc.), these attributes don’t necessarily apply across the board. The FAIRER acronym embodies a universally essential set of data management principles that should apply across all data domains with Ethical + Reproducible capturing the requirement for transparency.
Ethical Data: Ethical data means that: (a) Data are collected and managed in compliance with relevant government and professional codes of conduct, values and ethics, scientific integrity and responsible conduct of research; (b) Restricted, confidential, and sensitive data are handled appropriately, for example by implementing user authentication and controlled access to the data and and/or data anonymization and de-identification; (c) A statement is made as to whether or not Indigenous considerations exist and where applicable, Indigenous data sovereignty is respected and data are managed in accordance with CARE, OCAP and UNDRIP principles; (d) Data assets are managed in a manner such that data used as input to Big Data or Artificial Intelligence applications can be confirmed to be relevant, accurate, and up-to-date, and can be tested for unintended biases (TBS Directive 2019); (e) Contributors and contact person information is provided.
Reproducible Data: Reproducible data and code means that the final data and code are computationally reproducible within some tolerance interval or defined limits of precision and accuracy, i.e. a 3rd party will be able to verify the data lineage and processing, reanalyze the data and obtain consistent computational results using the same input raw data, computational steps, methods, computer software & code, and conditions of analysis in order to determine if the same result emerges from the reprocessing and reanalysis. “Same result” can mean different things in different contexts: identical measures in a fully deterministic context, the same numeric results but differing in some irrelevant detail, statistically similar results in a non-deterministic context, or validation of a hypothesis. All data and code are made available for 3rd-party verification of reproducibility. Note that reproducibility is a different concept from replicability. In the latter case, the final published data are linked to sufficiently detailed methods and information for a 3rd-party to be able to verify the results based on the independent collection of new raw data using similar or different methods but leading to comparable results.
DMP: A Data Management Plan (DMP) provides information required by various stakeholders (e.g., funders, managers, data stewards, researchers, scientists, librarians, IT support, etc.) about a specific dataset or about a project and its data for the purpose of costing, project management, data management, data curation, and open science.
maDMP (short definition): A machine-actionable Data Management Plans (maDMP), updated during the entire data lifecycle, provides information about a specific dataset or about a project and its data in a discipline agnostic standardized manner that is readable and reusable by both humans and automated systems. maDMPs facilitate collaboration, reporting, compliance, and integration with automated systems.
maDMP (long definition): Machine-actionable Data Management Plans (maDMPs), are an enterprise solution that operationalizes FAIRER (Findable, Accessible, Interoperable, Reusable, Ethical, and Reproducible) data management principles and enables an organization to plan more easily, document costing and funding, track inputs and outputs, provide customized reports, and ensure transparency throughout the data lifecycle. They provide information about contributors, partner agreements, distributions and licensing, storage, technical resources and computing needs, processing workflows, associated code and software, security and privacy, data quality, ethical issues, Indigenous considerations, retention and disposition, approvals, and more. maDMPs are the means for rapidly building reliable, lightweight, scalable, and easily customized automated systems with appropriate access controls while maximizing interoperability.