Software

These software tools may be freely used for all non-commercial purposes.

Department of Computer Science, Georgia State University

Atlanta, GA 30303

alexz@gsu.edu http://www.cs.gsu.edu/~cscazz/

CliqueSNV

Quasispecies reconstruction from NGS reads

QUASIM

QUAsispecies SIMulator software for viral population evolution with immune response

IsoEM 1.1.4

The IsoEM package can be used to infer isoform and gene expression levels from high-throughput transcriptome sequencing (RNA-Seq) data

xpathway

A set of tools that compares metabolic pathway activity analyzing mapping of contigs assembled from RNA-Seq reads to KEGG pathways. The XPathway analysis of pathway activity is based on expectation maximization and topological properties of pathway graphs

XPathway, a set of tools that compares pathway activity analyzing mapping of contigs assembled from RNA-Seq reads to KEGG pathways. The XPathway analysis of pathway activity is based on expectation maximization and topological properties of pathway graphs.

The different tools that constitute XPathway are:

1. KGMLPathway2Graph: Extraction tool of metabolic network

KGMLPathway2Graph aims at extracting metabolic pathways from KGML flatfile database. Readme, examples and software for KGMLPathway2Graph can be downloaded here .

2. Link Gopher 1.3.3: Mozilla Firefox add-ons

open kegg result page “pathway map”
use this filter “http://www.kegg.jp/kegg-bin/show_pathway?@” with link gopher to copy all green nodes per pathway. These are part of the pathway urls. Save the output in a file.

3. java code: To extract all green nodes

The code can be downloaded here along with Readme and examples.

4. Python code: To compute pathway activity level and significance.

The code can be downloaded here here along with Readme and examples.

5. shell script: To download all KGML file from Kegg using wget. This is a one time operation since ko xml file do not change. Code is available here

Infer Pathway activity level pipeline and Pathway significance pipeline

Use the steps provided in the Readme here Activity_level_and_Significance_Pipeline.

2SNV

Quasispecies reconstruction from long reads

Supplementary materials are available here

To run the tool Java Runtime Environment is necessary (http://java.com/en/download/index.jsp)

Software requires aligned reads in MSA format (fasta reads padded to the same length of all entries)

It is possible to convert pairwise aligned SAM(BAM) file to MSA with the help of program b2w (part of ShoRAH assembler)

Recommended aligner for long SMRT reads is BWA

for example:

bwa mem -k17 -W40 -r10 -A1 -B1 -O1 -E1 -L0 input.fasta > output.sam

To compress sort and index SAM to BAM(BAI) install samtools

samtools view -b reads.sam > reads.bam
samtools sort reads.bam -o reads.sorted -O bam
samtools index reads.sorted.bam

To run b2w it is not necessary to install whole ShoRAH software, as an alternative one can download C code from github b2w it is not necessary to install whole ShoRAH software, as an alternative one can download C code from github b2w.c and compile it with gcc or any other compartible C-compiler

b2w -w 2300 -i 0 -x 100000 -o aligned.reads.fas reads.sorted.bam ref.fasta ref_name:0:2300

To run 2snv use jar file

java -jar 2snv-1.0.jar aligned.reads.fas 1000 -t 30 -o haplotypes.fa

2SNV is available at

2SNV releases

For developers source code and instructions

2SNV on github

DATA

Reference flu1PB.fa

Clones in fasta format available clones.fa

Raw sequencing data have been submitted to the NIH Short Read Archive (SRA) under accession number: BioProject PRJNA284802

ScaffMatch

Scaffolding Algorirthm Based on Maximum Weight Matching

ScaffMatch v0.9 Igor Mandric - Alex Zelikovsky

email: mandric.igor@gmail.com

1) Software Description:

ScaffMatch is a novel scaffolding tool based on Maximum-Weight Matching able to produce high-quality scaffolds from NGS data (reads and contigs). The tool is written in Python 2.7. It also includes a bash script wrapper that calls aligner in case one needs to first map reads to contigs (instead of providing .sam files).

The arguments accepted by ScaffMatch are:

-w) Working directory -- this is the directory where ScaffMatch files are stored. These are .sam files produced after mapping reads to contigs and the resulting scaffolds file `scaffolds.fa` fasta file;

-c) Contig fasta file;

-m) Command line argument with no options. It is used when .sam files are used instead of reads .fastq files. Do not use this option if you provide reads files;

-1) (Comma separated list of) either .fastq or .sam file(s) corresponding to the first read of the read pair;

-2) (Comma separated list of) either .fastq or .sam file(s) corresponding to the second read of the read pair;

-i) (Comma separated list of) insert size(s) of the library(-ies);

-s) (Comma separated list of) library(-ies) standard deviation(s) of insert size(s);

-t) Bundle threshold. Pairs of contigs supported by number of read pairs less than the value of this argument are discarded. Optional argument, by default it is equal to 5;

-g) Matching heuristics: use `max_weight` for Maximum Weight Matching heuristics with the Insertion step, use `backbone` for Maximum Weight Matching heuristics without the Insertion step, use `greedy` for Greedy Matching heuristics;

-l) Log file - where to store the logs. Optional argument. By default, stdout is used.

One can use directly scaffmatch.py Python script when using .sam files.

2) Requirements:

* Python 2.7.x
* Bash >= 4
* Networkx >= 1.7
* Numpy >= 1.6.2
* Bowtie2

3) Scaffolding algorithm:

3.1) Algorithm Overview:

ScaffMatch algorithm consists of the following main steps:

1. Mapping reads to contigs - optional.
2. Constructing the scaffolding graph.
3. Maximum Weight Matching step - producing the backbone scaffolds.
4. Insertion step - inserting singletone contigs into the backbone.
5. Writing the final scaffolds.fa file.

3.2) Algorithm step by step:

1. We use bowtie2 to map reads to contigs.

2. The scaffolding graph G = (V, E) is constructed as follows: each vertex of the scaffolding graph G corresponds to one of the contig strands and each inter-contig edge corresponds to a bundle of read pairs connecting two strands of different contigs. The weight of an inter-contig edge is equal to the size of the corresponding bundle. Also for each contig we have a dummy edge connecting its two strands.

3. In our interpretation, the Scaffolding Problem is reduced to the problem of finding the Maximum Weight Matching in the scaffolding graph G. We use either the well-known blossom algorithm (implemented in Networkx library) or a greedy O(N * log N) heuristic. After the matching is found, we obtain the so-called backbone scaffolds.

4. After the matching step, an insertion of singletones into the backbone is performed. It helps to increase the number of correct contig joins. The usefulness of this step is demonstrated in the corresponding publications of the authors.

5. We write the scaffolds as a .fasta file. The gaps are filled with 'N's.

Download ScaffMatch-0.9.tar.gz

Pooling

Computational framework for next-generation sequencing of heterogeneous viral populations using combinatorial pooling

Source code for combinatorial pooling project:

1) Core pooling project: https://github.com/skumsp/Pooling

2) Maximum Likelihood k-Clustering of viral sequences: https://github.com/night-stalker/KGEM/tree/clustering

3) KEC viral NGS data processing (used as an auxiliary library in the core pooling project): https://github.com/skumsp/ErrorCorrection

The following external libraries are required:

1) Biojava (http://biojava.org/)

2) Commons-math (http://commons.apache.org/proper/commons-math/)

3) Commons-io (http://commons.apache.org/proper/commons-io/)

Experimental pooling NGS data sets used for the framework testing are available here:

http://alan.cs.gsu.edu/~skumsp/Pooling_experiments.rar

VGA

Viral Genome Assembler is a method for accurate assembly of a heterogeneous viral population coupled with a high-fidelity sequencing protocol able to eliminate errors from sequencing data.

kGEM

k-Genotype Expectation Maximization algorithm for Reconstructing a Viral population from Single-Amplicon reads

kGEM tool finds haplotypes for Single-amplicon sequencing data. This tool requires aligned reads in special internal format and auxiliary program B2W could help to convert reads in this format either from fasta (unaligned) format or from SAM (pairwise alignment) format.

To run both kGEM Java Runtime Environment is necessary (http://java.com/en/download/index.jsp)

Download kGEM

After reads_aligned.fas file obtained run KGEM using following comand:

java -jar <path_to_KGEM-v.jar> <path_to_reads>/aligned_reads.fas <k> -o <output_directory>

where <k> is a number of initial haplotypes for estimation (this number should be higher than actual number of haplotypes in population or for clustering more <k> could be reduced). This parameter is positive integer number

aligned_reads.fas reads obtained on previous step and <output_directory> (default: current) will contains two files after prograram will be finished. The file haplotypes.fa will contain haplotypes in fasta format and their frequencies in description (example:

>read1_0.38

ACTGGAA......

means that this haplotype has frequency 38%)

and second file will contain these haplotypes but instead of frequencies in description program just copy them proportionally to the frequencies. This file will contain the same number of entries as initial file with reads.

Note: result files reads.fa and haplotypes.fa may contain dashes '-' which were used for alignment, hence to get pure sequences file should be cleaned via any txt editor with command Repalce all '-' '' or in linux machines with command:

sed -e 's/\-//g' haplotypes.fa > haplotypes_cleaned.fa

Example

Assuming ERIF.jar KGEM.jar sample_data.fa and reference.fa are in current directory. Then first run following command:

java -jar ERIF.jar -g reference.fa -i sample_data.fa -o test_

Alternatively! you could use SAM file instead of fasta. (reads.sam)

java -jar ERIF.jar -g reference.fa -sam reads.sam -o test_

After that in this folder will appear output file test_reads.sam_ext.txt

Run next command:

java -jar KGEM-0.3.1.jar test_reads.sam_ext.txt 100

After completion of kGEM the two files will appear in current directory: haplotypes.fa and reads.fa

For linux users to clean dashes from output following command is available:

sed -e 's/\-//g' haplotypes.fa > haplotypes_cleaned.fa

And as a result haplotypes with their frequencies will be stored in haplotypes_cleaned.fa file.

For developers:

source code available on git repository KGEM_on_github.

Programming Language Scala, for compilation Maven is required.

Download and install Maven 2 or 3
Download sources from github repository
From the folder where sources is placed run:
mvn clean package
Note: you could download and build jar from maven repository directly:

mvn org.apache.maven.plugins:maven-depend
ency-plugin:2.4:get -DremoteRepositories=https://raw.github.org/night-stalker/KG

EM -Dartifact=kgem:kgem:0.3.1

ERIF currently not available from maven directly!

Also for developers using Maven kgem repository available, to be able to use it inside Maven project following configuration is necessary:

In the pom.xml add to tag repositories:

<repository>
<id>kgem</id>
<name>KGEM repository</name>
<url>https://github.com/night-stalker/KGEM</url>
</repository>

and to tag dependencies:

<version>$version</version>

</dependency>

VirA/VirA-MCF

Reconstruct viral quasipecies and estimate their frequencies from amplicon reads.

IsoEM

Infers isoform and gene expression levels from high-throughput transcriptome sequencing (RNA-Seq) data.

KEC

Pyrosequencing error correction algorithm

11/28/2012 The new version of KEC is available. The algorithm for error threshold finding based on fitting of Poisson distribution to k-counts distribution was added. Special thanks for helping to Bram Vrancken and Alex Artyomenko

02/27/2013 The new version of KEC is available. The user interface was updated and cross-paltform functionality was added. Special thanks to Alex Artyomenko

04/12/2013 The new version of KEC is available. An option allowing to use Muscle instead of Clustal for additional correction procedure was added. Special thanks to Alex Artyomenko

KEC is distributed under the GNU General Public License (http://www.gnu.org/copyleft/gpl.html)

Running instructions for KEC

• Download the java archive KEC.jar from KEC

• Download the implementation of the adaptive mean shift based clustering algorithm from http://coewww.rutgers.edu/riul/research/code/AMS/fams_pc.zip Create the folder with the name “fams” at the same folder, as ErrorCorrection.jar. Put the executable file “fams.exe” to the folder “fams”

• Download ClustalW2 from http://ftp.ebi.ac.uk/pub/software/clustalw2/ Create a folder with the name “ClustalW2” at the same folder as ErrorCorrection.jar. Put the executable file with the name “clustalw2.exe” to the folder “ClustalW2”

Download Muscle from http://www.drive5.com/muscle/ Create a folder with the name "Muscle" at the same folder as ErrorCorrection.jar. Put the executable file with the name “muscle.exe” to the folder “Muscle”

• Download the archive lib.rar from http://alan.cs.gsu.edu/~skumsp/lib.rar and extract it at the same folder as ErrorCorrection.jar

KEC running parameters:

java -jar ErrorCorrection.jar [-h] [-k k] [-i i] [-cl | -mus] [-l l] [-dg dg] [-dpp dpp] filename

Here

filename is the name of file containing reads to be corrected;
k is the size of k-mers. Default: k=25
i is the number of iterations of the algorithm. Default: i=3
-cl Enable using of CLustalW for multiple and pairwise sequence alignment for additional correction procedure. Default: do not align
-mus Enable using of Muscle for multiple and pairwise sequence alignment for additional correction procedure. Default: do not align
l is responsible for an error threshold finding. If l = 0, then the algorithm based on fitting of Poisson distribution to k-counts distribution is used. If l > 0, then the region of l consecutive zeros in the k-counts distribution is used to find the error threshold. Default: l =0
dg is the parameter for haplotypes postprocessing using multiple alignment (see parameter alpha, Algorithm 2, step 3)). Default: dg = 30
dpp is the parameter for postprocessing of haplotypes using pairwise alignment of neigbor leaves of neighbor joining tree (see parameter alpha, Algorithm 2, step 4). Default: dpp = 30
-h - help

Examples:

java -jar ErrorCorrection.jar -k 25 -i 3 -cl -l 25 test.fas

java -jar ErrorCorrection.jar test.fas

java -jar ErrorCorrection.fas -mus -l 1 -dg 15 -dpp 15 test.fas

java -jar ErrorCorrection.jar -h

The output contains several files. The most important are:

1) filename_corrected.fas_corrected.fas – corrected reads

2) filename_corrected.fas_haplotypes.fas - haplotypes found after the first stage of the algorithm (without allignment stage)

3) filename_corrected.fas_haplotypes.fas_postprocessed.fas_RevComp.fas_PostprocPair.fas_postprocessed.fas

_PostprocPair.fas - haplotypes found after the second stage of the algorithm using allignment (available only with -a)

Data sets

Data sets used in the paper are available at

1) sequencing results (fasta files, sff files)

2) haplotypes found by KEC

3) HVR1 clones used to create data sets (original and reverse complemented sequences)

Running instructions for ET

Will be here soon

References

P. Skums, Z. Dimitrova, D. S. Campo, G. Vaughan, L. Rossi, J. C. Forbi, J. Yokosawa, A. Zelikovsky, Y. Khudyakov, “Efficient error correction for next-generation sequencing of viral amplicons,” BMC Bioinformatics 13 (Suppl10): S6 2012, publisher url

MaLTA

Transcriptome assembly and quantification from RNA-Seq reads

MaLTA is a method for simultaneous transcriptome assembly and quantification from Ion Torrent RNA-Seq data. Our approach explores transcriptome structure and incorporates maximum likelihood model into assembly and quantification procedure. New version of IsoEM algorithm suitable for Ion Torrent RNA-Seq reads is used to accurately estimate transcript expression levels. Experimental results on both synthetic and real datasets show that Ion Torrent RNA-Seq data can be successfully used for transcriptome analyses. Experimental results suggest increased transcriptome assembly and quantification accuracy of MaLTA-IsoEM solution compared to existing state-of-the-art approaches. For details see out BMC Genomics paper.

MaLTA software tool is retired and no longer available. Instead, please use IsoEM2 software

Contact Information:

serghei@cs.ucla.edu

sahar@engr.uconn.edu

adrian.caciula@gmail.com

ion@engr.uconn.edu

alexz@cs.gsu.edu

ViSpA

Viral Spectrum Assembler implements a novel viral assembling and frequency estimation methods. This software uses a simple error correction, viral variants assembling based on maximum-bandwidth paths in weighted read graphs and frequency estimation via Expectation Maximization on all reads.

Introduction

ViSpA implements a novel viral assembling and frequency estimation methods. This software uses a simple error correction, viral variants assembling based on maximum-bandwidth paths in weighted read graphs and frequency estimation via Expectation Maximization on all reads. Experiments show that ViSpA is better in quasispecies assembling than the state-of-the-art method of ShoRAH.

ViSpA source code

The software is written in Java, EM part is written in python, wrapper script is written in bash scripting language.

ViSpA can be downloaded here (vispa02.zip). See the readme.txt file for installation instructions.

Contacts

Alexander Zelikovsky

Irina Astrovskaya

Bassam Tork

Department of Computer Science

Georgia State University

34 Peachtree Str., 1443

Atlanta, GA 30303

Phone: (404) 413 5730

Fax: (404) 413-5717

Email: alexz@cs.gsu.edu

web: http://www.cs.gsu.edu/~cscazz/

Department of Computer Science

Georgia State University

34 Peachtree Str., 1415

Atlanta, GA 30303

Email: iraa@cs.gsu.edu

Department of Computer Science

Georgia State University

34 Peachtree Str., 1415

Atlanta, GA 30303

Email: btork1@cs.gsu.edu

Related Publications

Astrovskaya, I., Tork, B., Mangul, S., Westbrooks, K., Mandoiu, I., Balfe, P., and Zelikovsky, A., Inferring Viral Spectrum from 454 Pyrosequencing Reads, 1st Annual RECOMB Satellite Workshop on Massively Parallel Sequencing (RECOMBseqCCB2011), BMC Bioinformatics (to appear) pdf

(sup. materials)

Astrovskaya, Irina A., "Inferring Genomic Sequences" (2011). Computer Science Dissertations. Paper 59 (pdf). http://digitalarchive.gsu.edu/cs_diss/59
Westbrooks, K., Astrovskaya, I., Rendon, D. C., Khudyakov, Y., Berman, P., and Zelikovsky, A., HCV Quasispecies Assembly using Network Flows, Proc. of Fourth International Symposium on Bioinformatics Research and Applications (ISBRA 2008), Lecture Notes in Computer Science vol. 4983, pp. 159-170.