These software tools may be freely used for all non-commercial purposes.
© Copyright, 2020 by Professor Alexander Zelikovsky
Department of Computer Science, Georgia State University
Atlanta, GA 30303
alexz@gsu.edu http://www.cs.gsu.edu/~cscazz/
Quasispecies reconstruction from NGS reads
QUAsispecies SIMulator software for viral population evolution with immune response
The IsoEM package can be used to infer isoform and gene expression levels from high-throughput transcriptome sequencing (RNA-Seq) data
A set of tools that compares metabolic pathway activity analyzing mapping of contigs assembled from RNA-Seq reads to KEGG pathways. The XPathway analysis of pathway activity is based on expectation maximization and topological properties of pathway graphs
XPathway, a set of tools that compares pathway activity analyzing mapping of contigs assembled from RNA-Seq reads to KEGG pathways. The XPathway analysis of pathway activity is based on expectation maximization and topological properties of pathway graphs.
The different tools that constitute XPathway are:
1. KGMLPathway2Graph: Extraction tool of metabolic network
KGMLPathway2Graph aims at extracting metabolic pathways from KGML flatfile database. Readme, examples and software for KGMLPathway2Graph can be downloaded here .
2. Link Gopher 1.3.3: Mozilla Firefox add-ons
open kegg result page “pathway map”
use this filter “http://www.kegg.jp/kegg-bin/show_pathway?@” with link gopher to copy all green nodes per pathway. These are part of the pathway urls. Save the output in a file.
3. java code: To extract all green nodes
The code can be downloaded here along with Readme and examples.
4. Python code: To compute pathway activity level and significance.
The code can be downloaded here here along with Readme and examples.
5. shell script: To download all KGML file from Kegg using wget. This is a one time operation since ko xml file do not change. Code is available here
Infer Pathway activity level pipeline and Pathway significance pipeline
Use the steps provided in the Readme here Activity_level_and_Significance_Pipeline.
Quasispecies reconstruction from long reads
Supplementary materials are available here
To run the tool Java Runtime Environment is necessary (http://java.com/en/download/index.jsp)
Software requires aligned reads in MSA format (fasta reads padded to the same length of all entries)
It is possible to convert pairwise aligned SAM(BAM) file to MSA with the help of program b2w (part of ShoRAH assembler)
Recommended aligner for long SMRT reads is BWA
for example:
bwa mem -k17 -W40 -r10 -A1 -B1 -O1 -E1 -L0 input.fasta > output.sam
To compress sort and index SAM to BAM(BAI) install samtools
samtools view -b reads.sam > reads.bam
samtools sort reads.bam -o reads.sorted -O bam
samtools index reads.sorted.bam
To run b2w it is not necessary to install whole ShoRAH software, as an alternative one can download C code from github b2w it is not necessary to install whole ShoRAH software, as an alternative one can download C code from github b2w.c and compile it with gcc or any other compartible C-compiler
b2w -w 2300 -i 0 -x 100000 -o aligned.reads.fas reads.sorted.bam ref.fasta ref_name:0:2300
To run 2snv use jar file
java -jar 2snv-1.0.jar aligned.reads.fas 1000 -t 30 -o haplotypes.fa
2SNV is available at
For developers source code and instructions
DATA
Reference flu1PB.fa
Clones in fasta format available clones.fa
Raw sequencing data have been submitted to the NIH Short Read Archive (SRA) under accession number: BioProject PRJNA284802
Scaffolding Algorirthm Based on Maximum Weight Matching
ScaffMatch v0.9 Igor Mandric - Alex Zelikovsky
email: mandric.igor@gmail.com
1) Software Description:
ScaffMatch is a novel scaffolding tool based on Maximum-Weight Matching able to produce high-quality scaffolds from NGS data (reads and contigs). The tool is written in Python 2.7. It also includes a bash script wrapper that calls aligner in case one needs to first map reads to contigs (instead of providing .sam files).
The arguments accepted by ScaffMatch are:
-w) Working directory -- this is the directory where ScaffMatch files are stored. These are .sam files produced after mapping reads to contigs and the resulting scaffolds file `scaffolds.fa` fasta file;
-c) Contig fasta file;
-m) Command line argument with no options. It is used when .sam files are used instead of reads .fastq files. Do not use this option if you provide reads files;
-1) (Comma separated list of) either .fastq or .sam file(s) corresponding to the first read of the read pair;
-2) (Comma separated list of) either .fastq or .sam file(s) corresponding to the second read of the read pair;
-i) (Comma separated list of) insert size(s) of the library(-ies);
-s) (Comma separated list of) library(-ies) standard deviation(s) of insert size(s);
-t) Bundle threshold. Pairs of contigs supported by number of read pairs less than the value of this argument are discarded. Optional argument, by default it is equal to 5;
-g) Matching heuristics: use `max_weight` for Maximum Weight Matching heuristics with the Insertion step, use `backbone` for Maximum Weight Matching heuristics without the Insertion step, use `greedy` for Greedy Matching heuristics;
-l) Log file - where to store the logs. Optional argument. By default, stdout is used.
One can use directly scaffmatch.py Python script when using .sam files.
2) Requirements:
* Python 2.7.x
* Bash >= 4
* Networkx >= 1.7
* Numpy >= 1.6.2
* Bowtie2
3) Scaffolding algorithm:
3.1) Algorithm Overview:
ScaffMatch algorithm consists of the following main steps:
1. Mapping reads to contigs - optional.
2. Constructing the scaffolding graph.
3. Maximum Weight Matching step - producing the backbone scaffolds.
4. Insertion step - inserting singletone contigs into the backbone.
5. Writing the final scaffolds.fa file.
3.2) Algorithm step by step:
1. We use bowtie2 to map reads to contigs.
2. The scaffolding graph G = (V, E) is constructed as follows: each vertex of the scaffolding graph G corresponds to one of the contig strands and each inter-contig edge corresponds to a bundle of read pairs connecting two strands of different contigs. The weight of an inter-contig edge is equal to the size of the corresponding bundle. Also for each contig we have a dummy edge connecting its two strands.
3. In our interpretation, the Scaffolding Problem is reduced to the problem of finding the Maximum Weight Matching in the scaffolding graph G. We use either the well-known blossom algorithm (implemented in Networkx library) or a greedy O(N * log N) heuristic. After the matching is found, we obtain the so-called backbone scaffolds.
4. After the matching step, an insertion of singletones into the backbone is performed. It helps to increase the number of correct contig joins. The usefulness of this step is demonstrated in the corresponding publications of the authors.
5. We write the scaffolds as a .fasta file. The gaps are filled with 'N's.
Download ScaffMatch-0.9.tar.gz
Computational framework for next-generation sequencing of heterogeneous viral populations using combinatorial pooling
Source code for combinatorial pooling project:
1) Core pooling project: https://github.com/skumsp/Pooling
2) Maximum Likelihood k-Clustering of viral sequences: https://github.com/night-stalker/KGEM/tree/clustering
3) KEC viral NGS data processing (used as an auxiliary library in the core pooling project): https://github.com/skumsp/ErrorCorrection
The following external libraries are required:
1) Biojava (http://biojava.org/)
2) Commons-math (http://commons.apache.org/proper/commons-math/)
3) Commons-io (http://commons.apache.org/proper/commons-io/)
Experimental pooling NGS data sets used for the framework testing are available here:
Viral Genome Assembler is a method for accurate assembly of a heterogeneous viral population coupled with a high-fidelity sequencing protocol able to eliminate errors from sequencing data.
k-Genotype Expectation Maximization algorithm for Reconstructing a Viral population from Single-Amplicon reads
kGEM tool finds haplotypes for Single-amplicon sequencing data. This tool requires aligned reads in special internal format and auxiliary program B2W could help to convert reads in this format either from fasta (unaligned) format or from SAM (pairwise alignment) format.
To run both kGEM Java Runtime Environment is necessary (http://java.com/en/download/index.jsp)
Download kGEM
After reads_aligned.fas file obtained run KGEM using following comand:
java -jar <path_to_KGEM-v.jar> <path_to_reads>/aligned_reads.fas <k> -o <output_directory>
where <k> is a number of initial haplotypes for estimation (this number should be higher than actual number of haplotypes in population or for clustering more <k> could be reduced). This parameter is positive integer number
aligned_reads.fas reads obtained on previous step and <output_directory> (default: current) will contains two files after prograram will be finished. The file haplotypes.fa will contain haplotypes in fasta format and their frequencies in description (example:
>read1_0.38
ACTGGAA......
means that this haplotype has frequency 38%)
and second file will contain these haplotypes but instead of frequencies in description program just copy them proportionally to the frequencies. This file will contain the same number of entries as initial file with reads.
Note: result files reads.fa and haplotypes.fa may contain dashes '-' which were used for alignment, hence to get pure sequences file should be cleaned via any txt editor with command Repalce all '-' '' or in linux machines with command:
sed -e 's/\-//g' haplotypes.fa > haplotypes_cleaned.fa
Assuming ERIF.jar KGEM.jar sample_data.fa and reference.fa are in current directory. Then first run following command:
java -jar ERIF.jar -g reference.fa -i sample_data.fa -o test_
Alternatively! you could use SAM file instead of fasta. (reads.sam)
java -jar ERIF.jar -g reference.fa -sam reads.sam -o test_
After that in this folder will appear output file test_reads.sam_ext.txt
Run next command:
java -jar KGEM-0.3.1.jar test_reads.sam_ext.txt 100
After completion of kGEM the two files will appear in current directory: haplotypes.fa and reads.fa
For linux users to clean dashes from output following command is available:
sed -e 's/\-//g' haplotypes.fa > haplotypes_cleaned.fa
And as a result haplotypes with their frequencies will be stored in haplotypes_cleaned.fa file.
source code available on git repository KGEM_on_github.
Programming Language Scala, for compilation Maven is required.
Download and install Maven 2 or 3
Download sources from github repository
From the folder where sources is placed run:
mvn clean package
Note: you could download and build jar from maven repository directly:
mvn org.apache.maven.plugins:maven-depend
ency-plugin:2.4:get -DremoteRepositories=https://raw.github.org/night-stalker/KG
EM -Dartifact=kgem:kgem:0.3.1
ERIF currently not available from maven directly!
Also for developers using Maven kgem repository available, to be able to use it inside Maven project following configuration is necessary:
In the pom.xml add to tag repositories:
<repository>
<id>kgem</id>
<name>KGEM repository</name>
<url>https://github.com/night-stalker/KGEM</url>
</repository>
and to tag dependencies:
<dependency>
<groupId>kgem</groupId>
<artifactId>kgem</artifactId>
<version>$version</version>
</dependency>
Reconstruct viral quasipecies and estimate their frequencies from amplicon reads.
Infers isoform and gene expression levels from high-throughput transcriptome sequencing (RNA-Seq) data.
Pyrosequencing error correction algorithm
11/28/2012 The new version of KEC is available. The algorithm for error threshold finding based on fitting of Poisson distribution to k-counts distribution was added. Special thanks for helping to Bram Vrancken and Alex Artyomenko
02/27/2013 The new version of KEC is available. The user interface was updated and cross-paltform functionality was added. Special thanks to Alex Artyomenko
04/12/2013 The new version of KEC is available. An option allowing to use Muscle instead of Clustal for additional correction procedure was added. Special thanks to Alex Artyomenko
KEC is distributed under the GNU General Public License (http://www.gnu.org/copyleft/gpl.html)
Running instructions for KEC
• Download the java archive KEC.jar from KEC
• Download the implementation of the adaptive mean shift based clustering algorithm from http://coewww.rutgers.edu/riul/research/code/AMS/fams_pc.zip Create the folder with the name “fams” at the same folder, as ErrorCorrection.jar. Put the executable file “fams.exe” to the folder “fams”
• Download ClustalW2 from http://ftp.ebi.ac.uk/pub/software/clustalw2/ Create a folder with the name “ClustalW2” at the same folder as ErrorCorrection.jar. Put the executable file with the name “clustalw2.exe” to the folder “ClustalW2”
or
Download Muscle from http://www.drive5.com/muscle/ Create a folder with the name "Muscle" at the same folder as ErrorCorrection.jar. Put the executable file with the name “muscle.exe” to the folder “Muscle”
• Download the archive lib.rar from http://alan.cs.gsu.edu/~skumsp/lib.rar and extract it at the same folder as ErrorCorrection.jar
KEC running parameters:
java -jar ErrorCorrection.jar [-h] [-k k] [-i i] [-cl | -mus] [-l l] [-dg dg] [-dpp dpp] filename
Here
filename is the name of file containing reads to be corrected;
k is the size of k-mers. Default: k=25
i is the number of iterations of the algorithm. Default: i=3
-cl Enable using of CLustalW for multiple and pairwise sequence alignment for additional correction procedure. Default: do not align
-mus Enable using of Muscle for multiple and pairwise sequence alignment for additional correction procedure. Default: do not align
l is responsible for an error threshold finding. If l = 0, then the algorithm based on fitting of Poisson distribution to k-counts distribution is used. If l > 0, then the region of l consecutive zeros in the k-counts distribution is used to find the error threshold. Default: l =0
dg is the parameter for haplotypes postprocessing using multiple alignment (see parameter alpha, Algorithm 2, step 3)). Default: dg = 30
dpp is the parameter for postprocessing of haplotypes using pairwise alignment of neigbor leaves of neighbor joining tree (see parameter alpha, Algorithm 2, step 4). Default: dpp = 30
-h - help
Examples:
java -jar ErrorCorrection.jar -k 25 -i 3 -cl -l 25 test.fas
java -jar ErrorCorrection.jar test.fas
java -jar ErrorCorrection.fas -mus -l 1 -dg 15 -dpp 15 test.fas
java -jar ErrorCorrection.jar -h
The output contains several files. The most important are:
1) filename_corrected.fas_corrected.fas – corrected reads
2) filename_corrected.fas_haplotypes.fas - haplotypes found after the first stage of the algorithm (without allignment stage)
3) filename_corrected.fas_haplotypes.fas_postprocessed.fas_RevComp.fas_PostprocPair.fas_postprocessed.fas
_PostprocPair.fas - haplotypes found after the second stage of the algorithm using allignment (available only with -a)
Data sets
Data sets used in the paper are available at
1) sequencing results (fasta files, sff files)
2) haplotypes found by KEC
3) HVR1 clones used to create data sets (original and reverse complemented sequences)
Running instructions for ET
Will be here soon
References
P. Skums, Z. Dimitrova, D. S. Campo, G. Vaughan, L. Rossi, J. C. Forbi, J. Yokosawa, A. Zelikovsky, Y. Khudyakov, “Efficient error correction for next-generation sequencing of viral amplicons,” BMC Bioinformatics 13 (Suppl10): S6 2012, publisher url
Transcriptome assembly and quantification from RNA-Seq reads
MaLTA is a method for simultaneous transcriptome assembly and quantification from Ion Torrent RNA-Seq data. Our approach explores transcriptome structure and incorporates maximum likelihood model into assembly and quantification procedure. New version of IsoEM algorithm suitable for Ion Torrent RNA-Seq reads is used to accurately estimate transcript expression levels. Experimental results on both synthetic and real datasets show that Ion Torrent RNA-Seq data can be successfully used for transcriptome analyses. Experimental results suggest increased transcriptome assembly and quantification accuracy of MaLTA-IsoEM solution compared to existing state-of-the-art approaches. For details see out BMC Genomics paper.
MaLTA software tool is retired and no longer available. Instead, please use IsoEM2 software
Contact Information:
Viral Spectrum Assembler implements a novel viral assembling and frequency estimation methods. This software uses a simple error correction, viral variants assembling based on maximum-bandwidth paths in weighted read graphs and frequency estimation via Expectation Maximization on all reads.
Introduction
ViSpA implements a novel viral assembling and frequency estimation methods. This software uses a simple error correction, viral variants assembling based on maximum-bandwidth paths in weighted read graphs and frequency estimation via Expectation Maximization on all reads. Experiments show that ViSpA is better in quasispecies assembling than the state-of-the-art method of ShoRAH.
ViSpA source code
The software is written in Java, EM part is written in python, wrapper script is written in bash scripting language.
ViSpA can be downloaded here (vispa02.zip). See the readme.txt file for installation instructions.
Contacts
Alexander Zelikovsky
Irina Astrovskaya
Bassam Tork
Department of Computer Science
Georgia State University
34 Peachtree Str., 1443
Atlanta, GA 30303
Phone: (404) 413 5730
Fax: (404) 413-5717
Email: alexz@cs.gsu.edu
web: http://www.cs.gsu.edu/~cscazz/
Department of Computer Science
Georgia State University
34 Peachtree Str., 1415
Atlanta, GA 30303
Email: iraa@cs.gsu.edu
Department of Computer Science
Georgia State University
34 Peachtree Str., 1415
Atlanta, GA 30303
Email: btork1@cs.gsu.edu
Related Publications
Astrovskaya, I., Tork, B., Mangul, S., Westbrooks, K., Mandoiu, I., Balfe, P., and Zelikovsky, A., Inferring Viral Spectrum from 454 Pyrosequencing Reads, 1st Annual RECOMB Satellite Workshop on Massively Parallel Sequencing (RECOMBseqCCB2011), BMC Bioinformatics (to appear) pdf
Astrovskaya, Irina A., "Inferring Genomic Sequences" (2011). Computer Science Dissertations. Paper 59 (pdf). http://digitalarchive.gsu.edu/cs_diss/59
Westbrooks, K., Astrovskaya, I., Rendon, D. C., Khudyakov, Y., Berman, P., and Zelikovsky, A., HCV Quasispecies Assembly using Network Flows, Proc. of Fourth International Symposium on Bioinformatics Research and Applications (ISBRA 2008), Lecture Notes in Computer Science vol. 4983, pp. 159-170.
Related Presentations
Astrovskaya, I., Westbrooks, K., Tork, B., Mangul, S., Mandoiu, I., Balfe, P., and Zelikovsky, A., Inferring Viral Population from Ultra-Deep Sequencing Data, workshop on Computational Advances for Next Generation Sequencing (CANGS 2011) (invited talk), February 2011.
Astrovskaya, I., Westbrooks, and Zelikovsky, A., Reconstruction of HCV Quasispecies Haplotypes from 454 Life Science Reads, ISBRA 2010 (short abstract), May 2010.
Astrovskaya, I., Westbrooks, K., Tork, B., Mangul, S., Mandoiu, I., Balfe, P., and Zelikovsky, A., VISPA: Viral Spectrum Assembling Method, The 1st IEEE International Conference on Computational Advances in Bio and Medical Sciences (ICCABS'11), February 2011, Best Poster Award.
This code may be freely used for all non-commercial purposes.
(c) Copyright, 2011 by Professor Alexander Zelikovsky
Department of Computer Science,
Georgia State University
Atlanta, GA 30303 (404) 413 5730
alexz@cs.gsu.edu http://www.cs.gsu.edu/~cscazz/Free via S
Call
Send SMS
Add to Skype
You'll need Skype CreditFree via Skype
SCALABLE PHASING METHOD BASED ON 2-SNP HAPLOTYPES. New phasing is based on statistically significant LD and deviation from Hardy-Weinberg equilibrium. On datasets across 69 regions from HapMap 2SNP is 3-4 orders of magnitude faster and usually outperforms HAPLOTYPER, GERBIL and matches PHASE.
Please, contact alexz@cs,gsu,edu regarding older software tools below:
MetNetAligner: Web service tool for
matching metabolic pathways/networks
identifying pathway holes (missing enzymes) and suggesting plausible candidates
and finding network motifs
DACS: Disease Association Combinatorial Search software searches for statistically significant multi-SNP combinations associated with a disease.
Tagging: Tag Selection based on SNP Multivariate Linear Regression.
Disease Association: This project explores disease susceptibility prediction on genotype/haplotype data.
Trio Phasing: Frequently case/control genotype data represents family trios consisting of two parents and one offspring. The trio phasing project develops a satisfactory method for phasing family trio data.