Workshop Papers‎ > ‎

Baby steps to data publication

Baby steps to data publication

John Kunze, Rachael Hu, Trisha Cruse, Catherine Mitchell,
Stephen Abrams, Kirk Hastings, Lisa Schiff

California Digital Library
9 December 2010

There is a need to establish a new publishing paradigm to cope with the
deluge of data artifacts produced by data-intensive science, many of
which are vital to data re-use and verification of published scientific
conclusions. Due to the limitations of traditional publishing, most of
these artifacts are not usually disseminated, cited, or preserved. These
latent artifacts consist largely of datasets and data processing
information that together form the foundations of the reasoned analyses
that appear in the published literature. But this traditional record of
science increasingly represents only the tip of the scientific iceberg.

One promising approach to this problem of data invisibility is to wrap
these artifacts in the metaphor of a “data paper”, a somewhat unfamiliar
bundle of scholarly output with a familiar facade. As envisioned, a data
paper minimally consists of a cover sheet and a set of links to archived
artifacts. The cover sheet contains familiar elements such as title,
authors, date, abstract, and persistent identifier (e.g., a DOI or ARK) —
just enough to permit basic exposure to and discovery of data by internet
search engines; also just enough to build a basic data citation, to
instill confidence in the identifier’s stability, and to be picked up by
indexing services such as Google Scholar.

This simple format represents only the first stage of the evolution of
the data paper. There is room for the format to increase in complexity
with the incorporation of other valuable elements, both general-purpose
and discipline-specific, to enrich discovery, re-use, and archiving. An
exciting potential outcome of this development of the data paper as
publication is the parallel emergence of a new kind of “data journal”.
Like regular journals, data journals would spring up around disciplines
and sub-disciplines as needed, and we could expect that some of them
would also be peer-reviewed. The data journal is envisioned as an
“overlay” journal; an editor would assemble an issue by selecting data
papers from any number of source and archives, and combining them with
front matter, a table of contents, editorial policies, submission
guidelines, etc.

This new data publishing paradigm promises to strengthen the scientific
community practices of data sharing, re-use, and preservation. Scientists
want to do science, get credit for it, communicate about it with their
peers, and improve the measurable outputs by which their funders and
employers evaluate their performance. The elements of the data paper
create a recognizable and standardized form for previously unpublished
data artifacts, making them easier to approach, evaluate, and
automatically index for basic discovery purposes. Those same elements can
easily be repurposed to create familiar-looking citations suitable for
reference in CVs and all manner of publication. Finally, unique
persistent identifiers for data papers and data artifacts greatly
facilitate automatic discovery of a data paper’s impact and re-use.