Manuals of 'DDBJ Read Annotation Pipeline’



Version 1.0.0
2012/6/26
DNA Data Bank of Japan
 
Currently the number of JOBs is NO RESTRICTION. (from 2012/6/25)
Please check major changes of DDBJ pipeline in new NIG supercomputer. 

Overview

The DDBJ Read Annotation Pipeline annotates raw sequencing reads from next-generation sequencers(NGS) with high throughput, which are registered in DDBJ Read Archive (DRA). The proposed pipeline consists of two processes: basic process for genome mapping and de novo assembly, and high-level process for structural and functional annotations such as single nucleotide polymorphism (SNP) detection and expression tag counts.

 
The pipeline has three distinct features. First, analytical results may be easily submitted to DDBJ databases using a streamlined process, whereby map outputs are converted to DRA formats, tag counting outputs are converted to DOR formats, and similarly the results of assembly/annotations are converted to DDBJ-based (International Nucleotide Sequence Database (INSD) formats. Second, a web-based graphical user interface enables biologists without high-level bioinformatics expertise to analyse large amounts of raw sequencing data. Third, the use of cluster computing systems and computers with large memory in DDBJ infrastructure allows for high throughput.  
To accomplish basic analysis functions, we installed popular mapping and assembly tools including bwa, velvet and others. For high-level analysis, analytical tools for SNP detection have been implemented in the current pipeline system. Other annotation tools will be implemented in the future version.
 
 

Advantage of the proposed pipeline

Apart from many other NGS pipelines, our pipeline focuses on the following two features.
1. Generating analytical sequences with file formats for DDBJ sequence registration
2. Evaluating respective pipeline analyses with statistical parameters

The first feature supports users to register analyzed sequences to DDBJ. The second feature provides
users numerical materials for research presentations. 

Table 1: File format of analyzed results

 analytical stage  outputs file format
 basic process (mapping)

 alignment

DRA annotation format 
 basic process (assembly)  contig DDBJ WGS format

Table 2: Evaluation parameters 

analytical stage target  evaluation parameters  availability
basic process
original reads  quality score  x

mapped reads  coverage  x

mapped reads  depth  x

mapped reads  error rate  x
  mapped reads  mapped ratio  x
  assembled contigs  maximum contig size  
  assembled contigs  N50 contig size  
high-level process
mapped reads  SNP  x

mapped reads  short indel  x


Publication

Kaminuma E., Mashima, J., Kodama,Y., Gojobori, T., Ogasawara, O., Okubo,K., Takagi,T., Nakamura,Y. (2010), DDBJ launches a new archive database with analytical tools for next-generation sequence data, Nucleic Acids Res.38,D33-D38