OSIRIS corpus


The OSIRIS corpus is a set of MEDLINE abstracts manually annotated with human variation mentions. The corpus is distributed under the terms of the Creative Commons Attribution License Creative Commons Attribution 3.0 Unported License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited (Furlong et al, BMC Bioinformatics 2008, 9:84).

The OSIRIS corpus can be used to assess the performance of both variation entity recognition and variation entity disambiguation to NCBI dbSNP identifiers.

For a detailed description on how the corpus was developed, see:

Furlong LI, Dach H, Hofmann-Apitius M, Sanz F. OSIRISv1.2: a named entity recognition system for sequence variants of genes in biomedical literature. BMC Bioinformatics 2008, 9:84.

What is a variation entity?

We use the term variation to refer to any kind of short range change in the nucleotide sequence of the genome. SNPs are the most studied type of sequence variation, but we can also consider as member of this class short insertions or deletions, named variations as Alu sequences, and other types of variations collected in the dbSNP database. These variations can be mapped to the exonic regions of genes, and produce a change at the protein level, or within introns, untranslated regions or between genes. Some variations may alter protein function, such as non synonymous SNPs, or alter other processes related with the regulation of gene expression. From the point of view of a Named Entity Recognition system, a variation entity is defined by the combination of tokens that specify the location of the variation in the sequence and the original and altered alleles.This information can be represented as nucleotide sequence or amino acid sequence. For instance, the term G894T can be interpreted in two ways: as a variation in the protein sequence involving the change of a glycine resiude to a threonine residue at position 894 of the protein, or a variation at the DNA level at a guanine residue in the gene sequence at position 894 that changes to a thymine residue.

Corpus statistics

 Number of articles 105
 Number of articles with NCBI Gene annotations 102
 Number of articles with NCBI dbSNP annotations 55
 Number of variations normalized to NCBI dbSNP identifiers 109
 Number of variations not normalized to NCBI dbSNP identifiers 155

Corpus format

The corpus is distributed in two formats: an XML file and a WorFreak format file. For editing the XML file, the Vex editor was used in the framework of the Eclipse platform. The corpus in WordFreak format contains a finer level of annotation of the variations: the location and alleles are annotated separately.

This format is suitable for machine-learning applications, for instance see as an example application of this corpus for the development of a Conditional Random Fields based NER system for variations:

Klinger R, Friedrich CM, Mevissen HT, Fluck J, Hofmann-Apitius M, Furlong LI, Sanz F. Identifying gene-specific variations in biomedical text. J Bioinform Comput Biol. 2007 Dec;5(6):1277-96.



Comments and suggestions: Laura I. Furlong (lfurlong@imim.es)