Tools

Databases

Molecular data play a key role in phylogenetic inference. Mammalian systematics provides us with a clear example, with several previously open evolutionary questions now able to be answered. However, molecular studies have until present used only a handful of classic markers and have not attempted to utilise the information contained within the increasingly large pool of mammalian genome sequences. The identification and utilisation of potentially new informative markers from this pool can help to further resolve the mammalian phylogenetic tree.

The EnsEMBL database was used to decide on a set of single-copy orthologous markers from those mammalian genomes available. Exons of reasonable length for further amplification from genomic DNA and sequencing in additional species were then selected. The phylogenetic utility and the evolutionary characteristics of these candidate markers were then evaluated using a homemade bioinformatics pipeline. The resulting OrthoMaM database can be interrogated through this website. The current OrthoMaM release is based on EnsEMBL v54. It now includes 6447 exons and 12958 CDS candidate markers for up to 33 taxa.

Link to ORTHOMAM database

Softwares

Until now the most efficient solution to align nucleotide sequences containing open reading frames was to use indirect procedures that align amino acid translation before reporting the inferred gap positions at the codon level. There are two important pitfalls with this approach. Firstly, any premature stop codon impedes using such a strategy. Secondly, each sequence is translated with the same reading frame from beginning to end, so that the presence of a single additional nucleotide leads to both aberrant translation and alignment.

We present an algorithm that has the same space and time complexity as the classical Needleman-Wunsch algorithm while accommodating sequencing errors and other biological deviations from the coding frame. The resulting pairwise coding sequence alignment method was extended to a multiple sequence alignment (MSA) algorithm implemented in a program called MACSE (Multiple Alignment of Coding SEquences accounting for frameshifts and stop codons). MACSE is the first automatic solution to align protein-coding gene datasets containing non-functional sequences (pseudogenes) without disrupting the underlying codon structure. It has also proved useful in detecting undocumented frameshifts in public database sequences and in aligning next-generation sequencing reads/contigs against a reference coding sequence.

MACSE is distributed as an open-source java file executable with freely available source code and can be used via a web interface.

Link to MACSE web server and to MACSE project page.

Marker-assisted selection strongly relies on genetic maps to accelerate breeding programs. High-density maps are now available for numerous species. Dedicated tools are required to compare several high-density maps on the basis of their key characteristics, while pinpointing their differences and similarities.

We developed the Genetic Map Comparator—a web-based application for easy comparison of different maps according to their key statistics and the relative positions of common markers.

For better reactivity or confidentiality a local version can be launched on any computer with a recent version of R and the R shiny package.

Link to the genetic map comparator server and to the associated project page.

Using Next Generation Sequencing, SNP discovery is relatively easy on diploid species and still hampered in polyploid species by the confusion due to homeology. We develop HomeoSplitter; a fast and effective solution to split original contigs obtained by RNAseq into two homeologous sequences. It uses the differential expression of the two homeologous genes in the RNA. We verify that the new sequences are closer to the diploid progenitors of the allopolyploid species than the original contig. By remapping original reads on these new sequences, we also verify that the number of valuable detected SNPs has significantly increased.

HomeoSplitter is a fast and effective solution to disentangle homeologous sequences based on a maximum likelihood optimization. On a benchmark set of 2,505 clusters containing homologous sequences of urartu, speltoides and durum, HomeoSplitter was efficient to build sequences closer to the diploid references and increased the number of valuable SNPs from 188 out of 1,360 SNPs detected when mapping the reads on the de novo durum assembly to 762 out of 1,620 SNPs when mapping on HomeoSplitter contigs.

HomeoSplitter provides a practical solution to the complex problem of disentangling homeologous transcripts in allo-tetraploids, which further allows an improved SNP detection.

Link to homeoSplitter Web Site (including program and documentation)

Semantic approaches such as concept-based information retrieval rely on a corpus in which resources are indexed by concepts belonging to a domain ontology. In order to keep such applications up-to-date, new entities need to be frequently annotated to enrich the corpus. However, this task is time-consuming and requires a high-level of expertise in both the domain and the related ontology. Different strategies have thus been proposed to ease this indexing process, each one taking advantage from the features of the document.

USI (User-oriented Semantic Indexer) is a fast and intuitive method for indexing tasks. We introduce a solution to suggest a conceptual annotation for new entities based on related already indexed documents. Our results, compared to those obtained by previous authors using the MeSH thesaurus and a dataset of biomedical papers, show that the method surpasses text-specific methods in terms of both quality and speed. Evaluations are done via usual metrics and semantic similarity.

Link to USI web server.

Supertree methods combine phylogenies with overlapping sets of taxa into a larger one. Topological conflicts frequently arise among source trees for methodological or biological reasons, such as long branch attraction, lateral gene transfers, gene duplication/loss or deep gene coalescence. When topological conflicts occur among source trees, liberal methods infer supertrees containing the most frequent alternative, while veto methods infer supertrees not contradicting any source tree, i.e. discard all conflicting resolutions. When the source trees host a significant number of topological conflicts or have a small taxon overlap, supertree methods of both kinds can propose poorly resolved, hence uninformative, supertrees.

To overcome this problem, PhySIC_IST propose to infer non-plenary supertrees, i.e. supertrees that do not necessarily contain all the taxa present in the source trees, discarding those whose position greatly differs among source trees or for which insufficient information is provided.

Link to PhySIC_IST web page

SuperTriplets is a triplet-based supertree approach to phylogenomics. It infers supertrees with branch support values.

When using a triplet-based representation of source trees, the matrix with parsimony method (MRP) is related to the median tree notion. We here introduce SuperTriplets, a new algorithm that is specially designed to optimize this alternative formulation of the MP criterion. The method avoids several practical limitations of the triplet-based binary matrix representation, making it useful to deal with large datasets. When the correct resolution of every triplet appears more often than the incorrect ones in source trees, SuperTriplets warrants to reconstruct the correct phylogeny. Both simulations and case studies on mammalian phylogenomics confirm the advantages of this approach. In both cases, SuperTriplets tends to propose less resolved but more reliable supertrees than those inferred using Matrix Representation with Parsimony (MRP).

Link to SuperTriplets web page

OntoFocus allows to restrict an Ontology to grasp concept relationships, this is especially usefull to aprehend the GO annotation of a gene or over-represented GO terms in transcriptomic analysis.

Given a reference-ontology, a "good" sub-ontology may be defined as the smallest self-explanatory excerpt containing concepts of interest. Given an initial set of concepts of interest, their relationships can generally not be explicit without adding some hyponyms and hyperonyms. This problem is tackled by using an algorithmic approach based on the is-a relation and two common operators: least common ancestor (lca) and greatest common descendant (gcd).

http://www.ontotoolkit.mines-ales.fr/

Web Server

PhyloExplorer is a tool to facilitate assessment and management of phylogenetic tree collections. Given an input collection of rooted trees, PhyloExplorer provides facilities for obtaining statistics describing the collection, correcting invalid taxon names, extracting taxonomically relevant parts of the collection using a dedicated query language, and identifying related trees in the TreeBASE database.

Link to PhyloExplorer server

OBIRS, an ontological based information retrieval system, designed to favor user interaction. OBIRS is a request method and

an environment based on aggregating models to assess the relevance of documents annotated by concepts of ontology. The selection of documents is displayed in a semantic map to provide graphical indications that make explicit to what extent they match the user’s query; this man/machine interface favors a more interactive and iterative exploration of data corpus, by facilitating the weighting of request concept and visual explanation.

This web server relies on OBIRS to query (human) genes based on the Gene Ontology (GO) concepts that annotate them. For instance, genes involved in teeth development can be searched using a query made of the two following GO concepts « structural constituent of tooth enamel » and « odontogenesis». OBIRS will not only retrieve the ENAM gene (indexed by both concepts) but also genes such as AMBN or COL1A1 which are indexed by different (but strongly related) concepts.

http://www.ontotoolkit.mines-ales.fr/ObirsClient

Code Library

A large number of bioinformatics applications in the fields of bio-sequence analysis, molecular evolution and population genetics typically share input/ouput methods, data storage requirements and data analysis algorithms. Such common features may be conveniently bundled into re-usable libraries, which enable the rapid development of new methods and robust applications.

Bio++ is a set of Object Oriented libraries written in C++. Available components include classes for data storage and handling (nucleotide/amino-acid/codon sequences, trees, distance matrices, population genetics datasets), various input/output formats, basic sequence manipulation (concatenation, transcription, translation, etc.), phylogenetic analysis (maximum parsimony, markov models, distance methods, likelihood computation and maximization), population genetics/genomics (diversity statistics, neutrality tests, various multi-locus analyses) and various algorithms for numerical calculus.

Link to BioPP home page