Workshop Papers‎ > ‎

    Baby steps to data publication

    Baby steps to data publication

    John Kunze, Rachael Hu, Trisha Cruse, Catherine Mitchell,
    Stephen Abrams, Kirk Hastings, Lisa Schiff

    California Digital Library
    9 December 2010

    There is a need to establish a new publishing paradigm to cope with the
    deluge of data artifacts produced by data-intensive science, many of
    which are vital to data re-use and verification of published scientific
    conclusions. Due to the limitations of traditional publishing, most of
    these artifacts are not usually disseminated, cited, or preserved. These
    latent artifacts consist largely of datasets and data processing
    information that together form the foundations of the reasoned analyses
    that appear in the published literature. But this traditional record of
    science increasingly represents only the tip of the scientific iceberg.

    One promising approach to this problem of data invisibility is to wrap
    these artifacts in the metaphor of a “data paper”, a somewhat unfamiliar
    bundle of scholarly output with a familiar facade. As envisioned, a data
    paper minimally consists of a cover sheet and a set of links to archived
    artifacts. The cover sheet contains familiar elements such as title,
    authors, date, abstract, and persistent identifier (e.g., a DOI or ARK) —
    just enough to permit basic exposure to and discovery of data by internet
    search engines; also just enough to build a basic data citation, to
    instill confidence in the identifier’s stability, and to be picked up by
    indexing services such as Google Scholar.

    This simple format represents only the first stage of the evolution of
    the data paper. There is room for the format to increase in complexity
    with the incorporation of other valuable elements, both general-purpose
    and discipline-specific, to enrich discovery, re-use, and archiving. An
    exciting potential outcome of this development of the data paper as
    publication is the parallel emergence of a new kind of “data journal”.
    Like regular journals, data journals would spring up around disciplines
    and sub-disciplines as needed, and we could expect that some of them
    would also be peer-reviewed. The data journal is envisioned as an
    “overlay” journal; an editor would assemble an issue by selecting data
    papers from any number of source and archives, and combining them with
    front matter, a table of contents, editorial policies, submission
    guidelines, etc.

    This new data publishing paradigm promises to strengthen the scientific
    community practices of data sharing, re-use, and preservation. Scientists
    want to do science, get credit for it, communicate about it with their
    peers, and improve the measurable outputs by which their funders and
    employers evaluate their performance. The elements of the data paper
    create a recognizable and standardized form for previously unpublished
    data artifacts, making them easier to approach, evaluate, and
    automatically index for basic discovery purposes. Those same elements can
    easily be repurposed to create familiar-looking citations suitable for
    reference in CVs and all manner of publication. Finally, unique
    persistent identifiers for data papers and data artifacts greatly
    facilitate automatic discovery of a data paper’s impact and re-use.