NGS Survey

Comprehensive Comparison of

Cloud-Based NGS Data Analysis and

Alignment Tools

Qanita Bani Baker, Mahmoud M. Hammada , Wesam Al-Rashdana , Yaser Jararweha,

Mohammad AL-Smadia , Mohammad Al-Zinatia

College of Computer and Information Technology

Jordan University of Science and Technology, Irbid, Jordan

Open Source NGS Tools:

· Galaxy [54] [55]: Galaxy is a free web-based platform that combines many popular tools and data sources for biomedical study and makes bioinformatics analyses available to different users, especially who does not have advanced skills in programming by authorizing them to assign parameters for operating tools and workflows out of its interface [54] [55]. Any arithmetical analysis is happened by willingly catching the parameters of tools, so the users can iterate and realize the whole analysis. Transparency has preserved by letting users easily arrival to share analysis through the web and make reactive document pages that represent the entire analysis.

· CloudBurst [56]: CloudBurst is an open-source tool used for mapping NGS data as a read mapping algorithm by MapReduce. It is faster than RMAP ( read-mapping program ) by 30 times and used the seed-and-extend algorithm in MapReduce for parallel processing to hasten the mapping operation.

· Crossbow [57]: Crossbow is an open-source tool based on Hadoop. It merges the velocity of Bowtie with the precision SOAPsnp to implement alignment and SNP detection to various data sets for the whole human every day. In the paper [57], the authors hope form utility computing services to be vastly available to anyone. It used a fixed-seed-and-extend search algorithm in the alignment phase.

· SeqMapReduce [59]: SeqMapReduce is an open-source tool that used the cloud for parallelizing sequence mapping, and it executed on the MapReduce model. The velocity is quasi-linear to the number of connected computers. It spends 4.5 minutes to mapping 6 million sequence reads in 32 computers for the human genome. SeqMapReduce is faster than CloudBurst and ease of use for naive clients. It used the Pigeonhole Principle algorithm and (Amazon Ec2) cloud.

· DIYA [60]: DIYA is a modular and configurable free pipeline tool for quick annotation of bacterial genome sequences. It's now utilized DNA contigs as an input in two forms (complete genomes or the outcome of shotgun sequencing) and outputs an annotated sequence in Genbank format. The DIYA will plan to merge DIYA into virtual devices for simple deployment in the cloud facilities and lab workstations. It used C++, Java, Perl, and Python languages.

· GATK [61] [62]: GATK is an open-source organized a programming framework prepared to facilitate the development of effective, and robust tools for DNA sequencers using MapReduce. It supplies a few but wealthy group of data access styles that includes the analysis tool requirements. GATK can be enhanced using dividing analysis calculations from data management infrastructure for constancy, correctness, memory, and CPU effectiveness. It used the genotyping algorithm and the language of their programming in Java.

· Myrna [64]: Myrna is a cloud (Amazon Ec2) pipeline to computation the differentiation in the gene expression in huge RNA-seq datasets. Myrna utilizes bowtie for short read alignment, also R/Bioconductor for normalization, testing, and quantification.

· Ergatis [65]: Ergatis is a workflow management system that lets the clients construct, implement, and observe pipelines of the genes data for the computational analysis. It includes pre-configured components and pipelines for a set of bio-missions like genome comparisons and prokaryotic genome annotation. The outcomes of most of these components loaded into a database named Chado relational.

· CloVR [66]: CloVR is an open-source modern tool for push-button automated sequence analysis that uses the resources of the cloud. It executes as a solo portable virtual machine that supplies various analysis pipelines or analysis protocols which used for whole-genome sequencing, microbial genomics, and metagenomes such as CLOVR-SEARCH, CLOVR-16S [67], CLOVR-METAGENOMICS [67], and CLOVR-MICROBE [68]. To enhance the performance, CloVR backing the use of remote cloud resources for huge sequence handling. CloVR needs at least 2GB of RAM and 10GB free disk space.

· Cloudaligner [71]: Cloudaligner is an open-source tool built based on Hadoop/MapReduce and realizes high performance. Cloudaligner outcome is more precise than RMAP. It contains a user-friendly interface to deal with long sequences, and it is faster than CloudBurst. Cloudaligner has omitted the reduce phase to improve performance. This tool needs to address very long reads that appear in the next-generation sequencers. It used a mapping algorithm called (seed-and-extend mapping algorithm) [29] for sequence alignment.

· RAPSearch2 [72]: RAPSearch2 is a modern memory-efficient execution of the RAPSearch algorithm that used to index a protein database by utilizes a collision-free hash table. It minimizes the memory requirement and executes a multi-threading technique that lets clients speed up the similarity search beyond that on multi-core CPUs. The language of their programming is in C++.

· Jnomics [73] [74]: Jnomics is an open-source cloud-scale suite for sequence analysis that prepared to help cover the computational defies offered through the continuing revolution in parallel DNA technologies. Jnomics features include the least configuration, extensibility, file-format agnostic, and parallelization of present tools. The language of their programming is in Java.

· PeakRanger [77]: PeakRanger is an open-source tool package for the ChIP-seq mechanism. This mechanism is attached to the NGS and lets the interactions between DNA and proteins. PeakRanger can be operated in a parallel Amazon Ec2 cloud to gain high performance on huge datasets. It used a staged algorithm, FDR based adaptive thresholding algorithm, and the summit-search algorithm. The language of their programming is in C++.

· ArrayExpressHTS [78]: ArrayExpressHTS is a pipeline based on R that used to preprocess, expression assessment, and quality estimation of data for RNA-Seq datasets. ArrayExpressHTS package lets the clients gain a criterion Bioconductor ExpressionSet object including expression levels of raw sequence files, plus a single R function call. The primary feature of ArrayExpressHTS is the ease of use, and it operated on the R cloud or a local node with either public or private data.

· SIMPLEX [79]: SIMPLEX is an open-source automated pipeline that used the public cloud (Amazon Ec2) to analyze exome sequencing data. It is fit to process single-end (SE), paired-end (PE) data, and to handle input data encoded in nucleotide space. To assist the exome-sequence data analysis, particularly in small labs, the pipeline has presented as a completely functional VirtualBox image that demands no extra installation of software and databases. It used the realignment algorithm of the Genome Analysis Toolkit. The language of their programming is in Java.

· Rainbow [83]: Rainbow is an open-source cloud-based tool package that can help in the automation of WGS data analysis. Rainbow is a reinforcement of Crossbow. It has many improvements such as: addressing BAM and FASTQ input files, dividing huge sequence files, gathering and tracking the working metrics of data processing, and combining SOAPsnp outcomes from various individuals into a solo file to smoothing downstream genome-wide organization studies. It used the merge-sort-based algorithm and Java virtual machine language.

· MEGAN [84]: MEGAN is an overall microbiome analysis tool used to analyze the taxonomic and metatranscriptomic samples. It's backing several file-formats for ease importing the data from several types of mapping and alignment tools. MEGAN hopes to supply a multilateral tool to analyze solo or sets of metagenomes on a computer. It used the lowest common ancestor algorithm, and their programming language is in Java.

· Stormbow [85]: Stormbow is an open-source, easy to use, and cost-effective software that used to analyze RNA-Seq data and to minimize the turnaround time in RNA-Seq. The tool was tested by analyzing 178 RNA-seq patterns on Amazon's cloud Ec2 and S3. In core, it is a wrapper of OSA. However, it stashes the complexity of immediately using OSA to analysis RNA-Seq on the cloud. It used an algorithm for mapping reads, and the languages of their programming are Perl, Shell, and R scripts.

· BioPig [87]: BioPig is an open-source sequence analysis software used Apache Pig and Hadoop for a big scale of the sequences data. It has many features such as its programmability highly for minimizes development time for parallel applications, its proportions automatically with the amount of data, and it ported without changing on various Hadoop infrastructures. It used several kmer-based algorithms and (Java, Shell (Bash)) programming language.

· Eoulsan [88]: Eoulsan is an open-source scalable framework that depends on the Hadoop execution of the MapReduce algorithm to the high productivity sequence of the RNA-data analysis. Eoulsan lets clients to facilely install a cloud cluster and automate the analysis of many samples immediately by different solutions of software available. It used Java programming language and used the public cloud (Amazon Web Services (AWS), EC2).

· Atlas2 [89]: Atlas2 is a variant detection package optimum that used for variant detection in exome catching data in all NGS equipment (Illumina, Roche 454 and SOLiD). It contains Atlas-SNP2 for rendering SNPs and Atlas-Indel2 for rendering INDELs. The language of their programming is in Python. It is available as open-source and used the public cloud (Amazon's cloud Ec2 and S3) and community cloud via the Genboree Workbench.

· TREAT [94]: TREAT is an open-source tool for easy navigation and mining of the variants from whole-exome sequencing and targeted resequencing. It provides integration of in-house developed annotations, variant-hosting genes, host-gene pathways, and visualizations for variants. It used the public cloud (Amazon Ec2).

· Cloud BioLinux [49]: Cloud BioLinux is a Virtual Machine available for the public to authorize the scientists providing on-demand infrastructures using cloud computing platforms for high-performance Bio-computing. The users have immediate arrival to a group of a pre-configured command line and graphical applications. The language of their programming is in Python. It used the public cloud (Amazon Ec2).

· HugeSeq [95]: HugeSeq is a full computational pipeline to completely automate the operation of variant detection from genomic sequences alignment to annotation and detection of all kinds of genetic variations SNPs, indels, and larger structural variations. It used well-established SNP and indel calling algorithms.

· VAT [98]: VAT is an open-source tool to annotate variants from various personal genes at the transcript level, also to gain outline statistics across genes and persons. VAT lets visualization of the effects of various variants merges allele frequencies and genotype data of the primary persons and assists comparative analysis among various sets of persons. VAT can be operated on Amazon Ec2 cloud as a virtual machine to reduce unneeded transfers of huge data and to authorize on-demand access. The language of their programming is in C and PHP.

· FX [99]: FX is an open-source tool that used to analyzing RNA-Seq. FX can be operated on a local Hadoop system and without exploitation in high-performance computing at minimum cost by Amazon Web Services (Amazon Ec2). The output of FX is short indels, SNP calls, and gene expression profiles. FX is used to estimate the genomic variant calling and gene expression levels. It has high accuracy and a user-friendly interface. It is developed using Java 1.6.0.

· YunBe [100]: YunBe is a free source gene set analysis software on the (Amazon Ec2) cloud. YunBe is available and fit to operate on AWS. It can speed up pathway-based biomarker identification during secure and inexpensive distributed computing. It used kipuMarkers algorithms and Java programming language.

· CloudMan [47] [48]: CloudMan is an open-source platform for cooperation and authorizes personal researchers to readily customize, deploy, and participate in their whole cloud analysis domain. CloudMan guarantees that the whole of the low-level infrastructure administration details is summarized and automated from the client. The language of their programming in Python. It used the public cloud (AWS) and private cloud (OpenStack and OpenNebula).

· Hadoop-BAM [102]: Hadoop-BAM is an open-source new library for the manipulation of aligned NGS data using the Hadoop framework with the Picard SAM JDK and other tools that used a command line. It considers as an integral layer among analysis tools and BAM files. Hadoop-BAM resolves the problems concerning to BAM data access through offering suitable API for executing map and reduce functions which can immediately run on BAM records.

· SparkSeq [103]: SparkSeq is an open-source flexible tool that has generated to get the feature of Apache Spark, a modern MapReduce framework, for NGS data. It's an elastic, general-purpose, and readily extendable library for genes on cloud computing. It can be utilized to construct genomic analysis pipelines in Scala and turn on them in an interactive path.

· BioVLAB-MMIA-NGS [105]: BioVLAB-MMIA-NGS is an open-source tool run on a powerful cloud and used to analyze the NGS data. BioVLAB-MMIA-NGS is a new version of BioVLAB-MMIA, and it's used on a high-performance server named MAHA and on Amazon cloud. It shows multiple features like the more precise sequence of data for setting miRNA expression levels or the execution of multiple computational models for characterizing miRNAs.

· Contrail [106]: Contrail is an open-source tool based on the Hadoop/MapReduce framework to authorize the de novo assembly of big genomes and to parallelize the assembly through a big number of nodes, also eliminating memory concerns. It depends on the graph-theoretic framework of de Bruijn graphs (DBG).

· Mercury [107]: Mercury is an elastic, automated, extendable analysis open-source workflow that supplies precise and reproducible genomic outcomes at domains extending from persons to big groups. It runs in local hardware, in AWS’s EC2, and S3 Public cloud using the DNAnexus platform.

· STORMSeq [109]: STORMSeq is an open-source tool that executes read mapping, variant calling, read cleaning, and annotation with genes data. It considers as a graphical interface solution for the cloud. It is supplied as an interface in Amazon Ec2, S3, and has a user-friendly pipeline for whole-exome sequencing.

· SURPI [110]: SURPI is an open-source mathematical pipeline for pathogen identification from sophisticated metagenomic NGS data produced. It can deploy on standalone servers and cloud-based (Amazon Ec2). Also, it used two aligners to speed up the analysis (SNAP and RAPSearch). The languages of their programming are the shell, Python, and Perl scripts.

· SeqPig [111]: SeqPig is an open-source library that utilizes Apache Pig for large sequencing datasets. It facilitates the method to manipulate the data, access, and analysis together with Hadoop and (EMR) to parallelly implement the data processing over the (S3, Amazon Elastic MapReduce) cloud. The language of their programming is in Java.

· SNP2Structure [113]: SNP2Structure is a structure database resource that concerned with mapping nsSNPs to 3D protein structures. It has many features such as the portal shows an immediate comparison of 2 concerning 3D structures, also the models of protein contain all interacting molecules in the main structures of PDB. Thus, the clients can set the areas of potential interaction changes during a protein mutation, and the mutated structures can install locally. They utilized the Jsmol package to show the protein structure that does not have any problem with system compatibility. It used the Amazon cloud.

· Halvade [114]: Halvade is an open-source framework that authorizes pipelines to parallelly implemented on different computers or multi-core infrastructure in an extremely effective way. Halvade depends on the MapReduce model, and it is used to implement DNA-seq/RNA-seq pipeline. The language of their programming in Java. It used the public cloud (Amazon Web Services (like EC2, S3)).

· CLUSTOM-CLOUD [115]: CLUSTOM-CLOUD is open-source software for distributed sequence clustering that depends on In-Memory Data Grid (IMDG) technology, which is the first distributed sequence clustering tool. IMDG is a distributed data structure to keep the whole data of various computing devices in the main memory. This technology assists CLUSTOM-CLOUD to promote both its ability to process big data sets and its mathematical scalability better than the previous versions. The language of their programming in Java. It operated on the public cloud (Amazon Ec2).

· MG-RAST [117] [118]: MG-RAST is an open-source platform that used for metagenomics sequence data analysis at Argonne National Lab. MG-RAST takes the raw sequence data from honestly recorded users and automates a collection of bioinformatics software to analyze, explain, and process data before returning analysis outcomes to the clients. It supports an HTML5/JavaScript in version 4.0. It used public clouds such as Shock, AWE server, and Amazon EC2.

· MC-GenomeKey [119]: MC-GenomeKey is an open-source tool package that effectively implements the different analysis workflow for revealing and annotating mutations by the resources of the cloud from various trade cloud providers. It lets various scenarios of implementation with various scales of sophistication. The languages of their programming are Python, C, C++, and Java. It operates on Public cloud (Amazon, Google, and Azure) and private cloud (OpenStack).

Commercial NGS Tools:

· BaseSpace [40]: BaseSpace is a powerful cloud platform used for analyzing, sharing, and storing genetic data easily. Illumina equipment sequences MiSeq and HiSeq data. These data are streamed through the Internet to BaseSpace to be analyzed by Applications. Many users can activate their applications in BaseSpace to display and analyzed their input files.

· Bina [41]: Bina is a workbench to imagine and analysis bioinformatics networks. it presents a service that has included Bina Box and cloud service hardware. Bina Box uses BWA and GATK tools to accelerate the NGS data analyses. While Bina cloud uses the analysis outcomes for sharing with others.

· DNAnexus [42]: DNAnexus is a cloud-based solution that used to manage and analyze NGS data. It can immediately upload the data created using specific software. Also, it has a scalable pipeline for large data used for the Amazon cloud. The DNAnexus platform on the cloud can manage all the resources and torn down them after the finish of use.

· LifeScope [44]: LifeScope is a modular bioinformatics tool for analyzing secondary and tertiary data created by LifeScope equipment. The LifeScope files used with the third side to visualize and analyses data. LifeScope gives powerful feedback to the customer for analyses of data tools in the SOLiD system.

· GeneSifter [43]: GeneSifter is a microarray service of managing systems for gene analysis. This service is designed to assist all forms of DNA sequencing. GeneSifter can be arrived using any computer or any mobile, and the used service is the only fee that the users pay.

· SevenBridges [45]: Seven Bridges is a master company in analysis data in the bioinformatics domain to help different healthcare research. Seven Bridges has immediate access to multi datasets and bioinformatics tools to detect different genomic data.

NGS Alignment Tools:

· BWA [58]: BWA (Burrows-Wheeler Aligner) is a package tool used to align short sequences reads with large reference genes. It contained 3 algorithms: BWA-backtrack, BWA-SW, and BWA-MEM. BWA utilized for gapped aligning for paired-end and single-end sequences, in addition to alignment quality.

· SAMtools [63]: SAMtools is used to process DNA data, and it should be collected with the code. It lets the biologists react with high-throughput genomics data. Also, it can order and combine alignments, delete polymerase chain reaction (PCR) repeats, and create per-position data.

· MAQ [69] [70]: MAQ (Mapping and Assembly with Quality) is a tool used for process short sequence alignment produced by NGS equipment. This tool maps sequences to a reference genome then invite the consensus. MAQ implements the ungapped alignment.

· BLAT [75] [76]: BLAT (BLAST-Like Alignment Tool) is an alignment tool that is so faster than BLAST and utilized for finding comparable sequences at the same species for a protein or DNA sequence offered by the researcher. It created to map millions of tags sequences at fast speed.

· BLAST [80] [81] [82]: BLAST (Basic Local Alignment Search Tool) is the most tool used to find the amount of similarity among data sequences and derived the functional among sequences and genes. It compares the protein sequences to the reference databases to compute important analysis.

· MUMmer GPU 2.0 [86]: MUMmerGPU 2.0 is a DNA sequence alignment tool used highly parallel graphics processing units (GPU) to speed up the calculation of data for mapping NGS data. It has many improvements from the MUMmerGPU 1.0 version to raise the implementation and ability of the tool.

· MUMmer [90] [91] [92] [93]: MUMmer is a software that used for quickly aligning whole genomes, whether in whole or incomplete form. Also, it can map incomplete genomes. MUMmer has depended on the suffix tree, and it is considered one of the most speed and active tools to accomplish this mission.

· SHRiMP [96] [97]: SHRiMP (Short Read Mapping Package) is a tool used for aligning genomic reads with a goal genome. It makes the alignment with comprehensive polymorphism and sequencing faults. SHRiMP backups parallel calculation, paired mapping, and miRNA mapping parameters.

· Bowtie [101]: Bowtie is one of the most used tools for short read alignment algorithms by companies. It used effective memory for aligning the pairs of short reads, and it depends on the Burrows-Wheeler index to save the memory footprint short. Bowtie has two phases: ungapped seed-finding phase and a gapped extension phase.

· Bowtie2 [104]: Bowtie 2 is a high speed, memory active tool used for the paired-end, gapped, and local alignment. This tool integrates the power of the full-text minute index with the elasticity and velocity of equipment to obtain the integration of sensibility, speed, and precision.

· SEAL [108]: SEAL is an alignment tool that collects BWA with duplicate read detection and removal. It allows alignment, manipulate, and analysis for short DNA reads. It supplies 6 Hadoop tools to elicit read from its equipment, divide the reads data, mapping and eliminate repetition, sort reads alignment, and compute a table of primary features for all factors.

· TopHat [112]: TopHat is a tool that used to aligning RNA-Seq data and analyzing the mapping outcomes to distinguish splice junctions among exons. It used to align reads from different lengths and to let variable-length indels comparing with the reference genes. The TopHat pipeline is prepared to discover junctions on a low scale of genes.

· HISAT2 [116]: HISAT2 is speed and sensitive tool that used for aligning NGS data to the human genome population. This tool merges different FM indexes to enhance analysis effectiveness. It has different options that let users choose customized scores. Also, its objectives are to create files that are suitable for other tools such as SAMtools or GATK.

Page updated

Google Sites

Report abuse