Items in this list were selected from the Open Data Handbook, whose content is licensed under a CC Attribution 4.0 International License. Additional definitions in the Open Data Handbook are linked to from some of the descriptions below.
Application Programming Interface. For data, this is usually a way provided by the data publisher for programs or apps to read data directly over the web. The app sends the API a query asking for the specific data it needs, e.g. the time of the next bus leaving a particular stop. This allows the app to use the data without downloading the whole dataset, saving bandwidth and ensuring that the data used is the most up-to-date available.
Processing data that includes personal information so that individuals can no longer be identified in the resulting data. Anonymization enables data to be published without breaching data protection principles. The principal techniques are aggregation and de-identification. Care must be taken to avoid data leakage that would result in individuals’ privacy being compromised. UKAN studies best practice in data anonymisation.
A piece of software (short for ‘application’), especially one designed to run on the web or on mobile phones and similar platforms. Apps can make network connections to large databases and thus be a powerful way of consuming open data, which may be real-time, personalised, and (using a mobile phone’s GPS) location-specific information. Crowdsourcing apps can also be used to build or improve datasets.
Acknowledging the source of data when using or re-publishing it. A data licence permitting the data to be used may include a requirement to attribute the source. Data subject to this restriction may still be considered open data according to the Open Definition.
A collection of data so large that it cannot be stored, transmitted or processed by traditional means. The increasing availability of and need to process such datasets (for example, huge collections of weather or other scientific data) has led to the development of specialised computer technologies, architectures and programming languages.
Data is available in bulk if the entire dataset can be downloaded easily and efficiently to a user’s own system. Conversely it is non-bulk if one is limited to getting small parts of the dataset, for example, are you restricted to a few elements of the data at a time and therefore require thousands or millions of requests to get the entire dataset. The provision of bulk access is a requirement of open data.
‘Comma-separated values’, a standard format for spreadsheet data. Data is represented in a plain text file, with each data row on a new line and commas separating the values on each row. As a very simple open format it is easy to consume and is widely used for publishing open data.
Actively involving the public in policy and decision-making. Citizen engagement is a central aim of open government, with the aims of improving decision making and gaining or retaining citizens’ consent and support. Open data is an essential tool for ensuring informed engagement.
Building tools and communities, usually online, that address particular civic or social problems. Examples could be tools that help users meet like-minded people locally based on particular interests, report broken infrastructure to their local council, or collaborate to clear litter from their neighbourhood. Local-level open data is particularly useful for civic hacking projects.
Data stored ‘in the cloud’ is handled by a hosting company, relieving the data owner of the need to manage its physical storage. Instead of being stored on a single machine, it may be stored across or moved between multiple machines in different locations, but the data owner and users do not need to know the details. The hosting company is responsible for keeping it available and accessible via the internet.
The process of automatically reading data in one file format and emitting the same data in a different format, thus making the data accessible to a wider range of applications.
A legal right over intellectual property (e.g. a book) belonging to the creator of the work. While individual data (facts) cannot be copyright, a database will in general be covered by copyright protecting the selection and arrangement of data within it. A copyright holder may use a licence to grant other people rights in the protected material, perhaps subject to specified restrictions.
A non-profit organisation founded in 2001 that promotes re-usable content by publishing a number of standard licences, some of them open (though others include a non-commercial clause), that can be used to release content for re-use, together with clear explanations of their meaning.
Data may be thought of as unprocessed atomic statements of fact. It very often refers to systematic collections of numerical information in tables of numbers such as spreadsheets or databases. When data is structured and presented so as to be useful and relevant for a particular purpose, it becomes information available for human apprehension. See also knowledge.
A system that allows outsiders to be granted access to databases without overloading either system.
Processing a dataset to make it easier to consume. This may involve fixing inconsistencies and errors, removing non-machine-readable elements such as formatting, using standard labels for row and column headings, ensuring that numbers, dates, and other quantities are represented appropriately, conversion to a suitable file format, reconciliation of labels with another dataset being used (see data integration), etc. See data quality.
Datasets are created by collecting data in different ways: from manual or automatic measurements (e.g. weather data), surveys (census data), records of decisions (budget data) or ongoing transactions (spending data), aggregation of many records (crime data), mathematical modelling (population projections), etc.
Almost any interesting use of data will combine data from different sources. To do this it is necessary to ensure that the different datasets are compatible: they must use the same names for the same objects, the same units or co-ordinates, etc. If the data quality is good this process of data integration may be straightforward but if not it is likely to be arduous. A key aim of linked data is to make data integration fully or nearly fully automatic. Non-open data is a barrier to data integration, as obtaining the data and establishing the necessary permission to use it is time-consuming and must be done afresh for each dataset.
If personal data has been imperfectly anonymised, it may be possible by piecing it together (perhaps with data available from other sources) to reconstruct the identity of some data subjects together with personal data about them. The personal data, which should not have been published (see data protection ), may be said to have ‘leaked’ from the ‘anonymised’ data. Other kinds of confidential data can also be subject to leakage by, for example, poor data security measures. See de-identification.
The policies, procedures, and technical choices used to handle data through its entire lifecycle from data collection to storage, preservation and use. A data management policy should take account of the needs of data quality, availability, data protection, data preservation, etc.
A web platform for publishing data. The aim of a data portal is to provide a data catalogue, making data not only available but discoverable for data users, while offering a convenient publishing workflow for publishing organisations. Typical features are web interfaces for publishing and for searching and browsing the catalogue, machine interfaces (APIs) to enable automatic publishing from other systems, and data preview and visualisation.
Data protection legislation is not about protecting the data, but about protecting the right of citizens to live without fear that information about their private lives might become public. The law protects privacy (such as information about a person’s economic status, health and political position) and other rights such as the right to freedom of movement and assembly. For example, in Finland a travel card system was used to record all instances when the card was shown to the reader machine on different public transport lines. This raised a debate from the perspective of freedom of movement and the travel card data collection was abandoned based on the data protection legislation.
A measure of the useableness of data. An ideal dataset is accurate, complete, timely in publication, consistent in its naming of items and its handling of e.g. missing data, and directly machine-readable (see data cleaning), conforms to standards of nomenclature in the field, and is published with sufficient metadata that users can easily understand, for example, who it is published by and the meaning of the variables in the dataset.
A person converting data into a usable form so that they can be easily used with automated or semi-automated tools. Data wrangling may include further data cleaning.
(i) Any organised collection of data may be considered a database. In this sense the word is synonymous with dataset.
(ii) A software system for processing and managing data, including features to extend or update, transform and query the data. Examples are the open source PostgreSQL, and the proprietary Microsoft Access.
Database rights [Permalink]
A right to prevent others from extracting and reusing content from a database. Exists mainly in European jurisdictions.
Any organised collection of data. ‘Dataset’ is a flexible term and may refer to an entire database, a spreadsheet or other data file, or a related collection of data resources.
A form of anonymisation where personal records are kept intact but specific identifying information, such as names, are replaced with anonymous identifiers. Compared to aggregation, de-identification carries a greater risk of data leakage: for example, if prison records include a prisoner’s criminal record and medical history, the prisoner could in many cases be identified even without their name by their criminal record, giving unauthorised access to their medical history. In other cases this risk is absent, or the value of the un-aggregated data is so great that it is worth making de-identified data available subject to carefully designed safeguards.
An ordinary table or spreadsheet can easily represent two data dimensions: each data point has a row and a column. Plenty of real-world data has more dimensions, however: for example, a dataset of Earth surface temperature varying with position and time (two co-ordinates are required to specify the position on earth, e.g. latitude and longitude, and one to specify the time).
It is not enough for open data to be published if potential users cannot find it, or even do not know that it exists. Rather than simply publishing data haphazardly on websites, governments and other large data publishers can help make their datasets discoverable by indexing them in catalogues or data portals.
The description of how a file is represented on a computer disk. The format usually corresponds to the last part of the file name (‘extension’), e.g. a file in CSV format might be called schools-list.csv. The file format refers to the internal format of the file, not how it is displayed to users. E.g. CSV and XLS files are structured very differently on disk, but may look similar or identical when opened in a spreadsheet program such as Excel.
A rating system for open data proposed by Tim Berners-Lee, founder of the World Wide Web. To score the maximum five stars, data must (1) be available on the Web under an open licence, (2) be in the form of structured data, (3) be in a non-proprietary file format, (4) use URIs as its identifiers (see also RDF), (5) include links to other data sources (see linked data). To score 3 stars, it must satisfy all of (1)-(3), etc.
Geographical Information System, any computer system designed to read, display, analyse and manipulate geodata.
A dialect of JSON with specialised features for describing geodata, and hence a popular interchange format for geodata.
Any dataset where data points include a location, e.g. as latitude and longitude or another standard encoding. Maps, transport routes, environmental data, catastral data, and many other kinds of data can be published as geodata.
The work of government involves collecting huge amounts of data, much of which is not confidential (economic data, demographic data, spending data, crime data, transport data, etc). The value of much of this data can be greatly enhanced by releasing it as open data, freeing it for re-use by business, research, civil society, data journalists, etc.
An event, usually over one or two days, where developers, subject experts and others come together to create apps, visualisations and prototypes that aim to address problems in a particular domain, usually making heavy use of data. Hackathons focusing on a particular collection of data are a possible form of community engagement by data publishers. The hackathon is a popular format in the open source community.
A company that stores a customer’s data on its own (the host’s) computers and makes it available over the internet. A hosted service is one that runs and stores data on the service-provider’s computers and is accessed over the network. See also SaaS.
Data in a format that can be conveniently read by a human. Some human-readable formats, such as PDF, are not machine-readable as they are not structured data, i.e. the representation of the data on disk does not represent the actual relationships present in the data.
The name of an object or concept in a database. An identifier may be the object’s actual name (e.g. ‘London’ or ‘W1 1AA’, a London postcode), or a word describing the concept (‘population’), or an arbitrary identifier such as ‘XY123’ that makes sense only in the context of the particular dataset. Careful choice of identifiers using relevant standards can facilitate data integration. See linked data.
A structured collection of data presented in a form that people can understand and process. Information is converted into knowledge when it is contextualised with the rest of a person’s knowledge and world model.
JavaScript Object Notation, a simple but powerful format for data. It can describe complex data structures, is highly machine-readable as well as reasonably human-readable, and is independent of platform and programming language, and is therefore a popular format for data interchange between programs and systems.
Keyhole Markup Language, an XML-based open format for geodata. KML was devised for Keyhole Earth Viewer, later acquired by Google and renamed Google Earth, but has been an international standard of the Open Geospatial Consortium since 2008.
The sum of a person’s - or mankind’s - information about and ability to understand the world. See also data
A legal instrument by which a copyright holder may grant rights over the protected work. Data and content is open if it is subject to an explicitly-applied licence that conforms to the Open Definition. A range of standard open licences are available, such as the Creative Commons CC-BY licence, which requires only attribution.
If Project X publishes content, and wants to include content from Project Y, it is necessary that Y’s licence permits at least the same range of re-uses as X’s licence. For example, content published under a non-commercial licence cannot be included in Wikipedia, since Wikipedia’s open licence includes rights for commercial re-use which cannot be granted for the non-commercial data, an example of a failure of licences to mix well.
A form of data representation where every identifier is an http://… URI, using standard lists (see vocabulary) of identifiers where possible, and where datasets include links to reference datasets of the same objects. A key aim is to make data integration automatic, even for large datasets. Linked data is usually represented using RDF. See also five stars of open data; triple store.
Data in a data format that can be automatically read and processed by a computer, such as CSV, JSON, XML, etc. Machine-readable data must be structured data. Compare human-readable.
Non-digital material (for example printed or hand-written documents) is by its non-digital nature not machine-readable. But even digital material need not be machine-readable. For example, consider a PDF document containing tables of data. These are definitely digital but are not machine-readable because a computer would struggle to access the tabular information - even though they are very human readable. The equivalent tables in a format such as a spreadsheet would be machine readable.
As another example scans (photographs) of text are not machine-readable (but are human readable!) but the equivalent text in a format such as a simple ASCII text file can machine readable and processable.
Note: The appropriate machine readable format may vary by type of data - so, for example, machine readable formats for geographic data may differ from those for tabular data.
If something is visible to many people then, collectively, they are more likely to find errors in it. Publishing open data can therefore be a way to improve its accuracy and data quality, especially where a good interface for reporting errors is provided. See crowdsourcing.
Information about a dataset such as its title and description, method of collection, author or publisher, area and time period covered, licence, date and frequency of release, etc. It is essential to publish data with adequate metadata to aid both discoverability and usability of the data.
Non-governmental organisation. NGOs are voluntary, non-profit organisations focussing on charitable work, community-building, campaigning, research, etc, making up a vital part of civil society.
A restriction, as part of a licence, that content cannot be freely re-used for ‘commercial’ purposes. Content or data subject to a non-commercial restriction is not open, according to the Open Definition. Such a restriction reduces economic value and causes problems with licence mixing, as well as often ruling out more than is intended (for example, it is often unclear whether educational uses are ‘commercial’). The intent of a non-commercial clause may be better captured by a share-alike requirement.
Open Data Readiness Assessment, a framework created by the World Bank for assessing the opportunities, obstacles and next steps to be taken in a country (especially a developing country) considering publishing government data as open data.
Open Database Licence, an attempt to create an open licence for data which covers the ‘database rights’ (see copyright) as well as copyright itself. It does this by imposing contractual obligations on the data re-user. Unfortunately contract law is fundamentally different from copyright law, since copyright is inherent in a work and binds all downstream users of the work, whereas a contract only binds the parties to the contract and has no force on a later re-user of re-published data. The ODbL remains useful nevertheless, and other attempts are being made to create open licences specifically for data.
The principle that access to the published papers and other results of research, especially publicly-funded research, should be freely available to all. This contrasts with the traditional model where research is published in journals which charge subscription fees to readers. Besides benefits similar to the benefits of open data, proponents suggest that it is immoral to withhold potentially life-saving and valuable research from some readers who may be able to use or build on it. Open-access journals now exist and the interest of research funders is giving them some traction, especially in the sciences.
Data is open if it can be freely accessed, used, modified and shared by anyone for any purpose - subject only, at most, to requirements to provide attribution and/or share-alike. Specifically, open data is defined by the Open Definition and requires that the data be A. Legally open: that is, available under an open (data) license that permits anyone freely to access, reuse and redistribute B. Technically open: that is, that the data be available for no more than the cost of reproduction and in machine-readable and bulk form.
Software for which the source code is available under an open licence. Not only can the software be used for free, but users with the necessary technical skills can inspect the source code, modify it and run their own versions of the code, helping to fix bugs, develop new features, etc. Some large open source software projects have thousands of volunteer contributors. The Open Definition was heavily based on the earlier Open Source Definition, which sets out the conditions under which software can be considered open source.
file format with no restrictions, monetary or otherwise, placed upon its use and can be fully processed with at least one free/libre/open-source software tool. Patents are a common source of restrictions that make a format proprietary. Often, but not necessarily, the structure of an open format is set out in agreed standards, overseen and published by a non-commercial expert body. A file in an open format enjoys the guarantee that it can be correctly read by a range of different software programs or used to pass information between them.
Open government, in line with the open movement generally, seeks to make the workings of governments transparent, accountable, and responsive to citizens. It includes the ideals of democracy, due process, citizen participation and open government data. A thorough-going approach to open government would also seek to enable citizen participation in, for example, the drafting and revising of legislation and budget-setting. See OGP.
The open movement seeks to work towards solutions of many of the world’s most pressing problems in a spirit of transparency, collaboration, re-use and free access. It encompasses open data, open government, open development, open science and much more. Participatory processes, sharing of knowledge and outputs and open source software are among its key tools. The specific definition of “open” as applied to data, knowledge and content, is set out by the Open Definition.
Generally understood as technical standards which are free from licencing restrictions. Can also be interpreted to mean standards which are developed in a vendor-neutral manner.
(i) Proprietary software is owned by a company which restricts the ways in which it can be used. Users normally need to pay to use the software, cannot read or modify the source code, and cannot copy the software or re-sell it as part of their own product. Common examples include Microsoft Excel and Adobe Acrobat. Non-proprietary software is usually open source.
(ii) A proprietary file format is one that a company owns and controls. Data in this format may need proprietary software to be read reliably. Unlike an open format, the description of the format may be confidential or unpublished, and can be changed by the company at any time. Proprietary software usually reads and saves data in its own proprietary format. For example, different versions of Microsoft Excel use the proprietary XLS and XLSX formats.
Content to which copyright does not apply, for example because it has expired, is free for any kind of use by anyone and is said to be in the public domain. CC0, one of the licences of Creative Commons, is a ‘public domain dedication’ which attempts so far as possible to renounce all rights in the work and place it in the public domain.
Anyone who distributes and makes available data or other content. Data publishers include government departments and agencies, research establishments, NGOs, media organisations, commercial companies, individuals, etc.
A type of question accepted by a database about the data it holds. A complex query may ask the database to select records according to some criteria, aggregate certain quantities across those records, etc. Many databases accept queries in the specialised language SQL or dialects of it. A web API allows an app to send queries to a database over the web. Compared with downloading and processing the data, this reduces both the computation load on the app and the bandwidth needed.
The original data, in machine-readable form, underlying any application, visualisation, published research or interpretation, etc.
It is rare that data gathered for a particular purpose does not have other possible uses. Happily, data is an infinite resource (see tragedy of the anti-commons); once gathered, for whatever reason, it can be re-used again and again, in ways that were never envisaged when it was collected, provided only that the data-holder makes it available under an open licence to enable such re-use.
Data (such as the current location of trains on a network) which is being constantly updated, where a query needs to be against the latest version of the data.
Experimental research in the sciences and social sciences produces large quantities of data. Research data management (RDM) is an emerging discipline that seeks best practices in handling this. Traditionally the data was kept by researchers and only final research outputs, such as papers analysing the data, would be published. Open science holds that the data should be published, both to increase verifiability of the work and to enable it to be used in other research. The full spirit of open science collaboration demands data publication early in the project, but research culture will need to change appreciably before this becomes widespread.
Structured Query Language, a standard language used for interrogating many types of database. See query.
Software as a Service, i.e. a software program that runs, not on the user’s machine, but on the machines of a hosting company, which the user accesses over the web. The host takes care of associated data storage, and normally charges for the use of the service or monetises its client base in other ways.
Extracting data from a non-machine-readable source, such as a website or a PDF document, and creating structured data from the result. Screen-scraping a dataset requires dedicated programming and is expensive in programmer time, so is generally done only after all other attempts to get the data in structured form have failed. Legal questions may arise about whether the scraping breaches the source website’s copyright or terms of service.
A computer on the internet, usually manged by a hosting company, that responds to requests from a user, e.g. for web pages, downloaded files or to access features in a SaaS package being run on the server.
A popular file format for geodata, maintained and published by Esri, a manufacturer of GIS software. A Shapefile actually consists of several related files. Though the format is technically proprietary, Esri publishes a full specification standard and Shapefiles can be read by a wide range of software, so function somewhat like an open standard in practice.
A license that requires users of a work to provide the content under the same or similar conditions as the original.
The files of computer code written by programmers that are used to produce a piece of software. The source code is usually converted or ‘compiled’ into a form that the user’s computer can execute. The user therefore never sees the original source code, unless it is published as open source.
A table of data and calculations that can be processed interactively with a specialised spreadsheet program such as Microsoft Excel or OpenOffice Calc.
A published specification for, e.g., the structure of a particular file format, recommended nomenclature to use in a particular domain, a common set of metadata fields, etc. Conforming to relevant standards greatly increases the value of published data by improving machine readability and easing data integration.
All data has some structure, but ‘structured data’ refers to data where the structural relation between elements is explicit in the way the data is stored on a computer disk. XML and JSON are common formats that allow many types of structure to be represented. The internal representation of, for example, word-processing documents or PDF documents reflects the positioning of entities on the page, not their logical structure, which is correspondingly difficult or impossible to extract automatically.
Tab-separated values (TSV) are a very common form of text file format for sharing tabular data. The format is extremely simple and highly machine-readable.
Governments and other organisations are said to be transparent when their workings and decision-making processes are well-understood, properly documented and open to scrutiny. Transparency is one of the aspects of open government. An increase in transparency is one of the benefits of open data.
Public transport routes, timetables and real time data are valuable but difficult candidates for open data. Even when they are published, data from different transit authorities and companies may not be available in compatible formats, making it difficult for third parties to provide integrated transport information. Many transport authorities distribute public transport data using the General Transit Feed Specification (GTFS) which is maintained by Google. Work on standardisation and more open data is ongoing in the sector.
A visual representation of data is often the most compelling way of communicating the data, bringing out its key features, correlations and outliers. Though many tools exist, creating a visualisation for a dataset is not an automatic process, but requires careful attention to the meaning of the variables, the relations between them and the stories inherent in the data, to design a visual representation that lets the message of the data shine through.
An API that is designed to work over the Internet.
A proprietary spreadsheet format, the native format of the popular Microsoft Excel spreadsheet package. Older versions use .xls files, while more recent ones use the XML-based .xlsx variant.
Extensible Markup Language, a simple and powerful standard for representing structured data.
dimension [Permalink]
An ordinary table or spreadsheet can easily represent two data dimensions: each data point has a row and a column. Plenty of real-world data has more dimensions, however: for example, a dataset of Earth surface temperature varying with position and time (two co-ordinates are required to specify the position on earth, e.g. latitude and longitude, and one to specify the time).