Bioinformatics has now acquired a dramatic importance for the study of numerous diseases like cancers, bovine spongiform encephalopathy, Alzheimer… It is also essential to analyze ecological data showing the effect of environmental change and pollution. Thanks to new effective technologies (high scale sequencing, mass spectrometry, DNA chip…) Bioinformatics has highly diversified itself these last years. The quantity of the generated data grows with exponential speed and makes it all the more crucial for effective analysis algorithms to be developed. I have focused my research mostly on three domains: proteomics, biological sequences analysis and micro-array data analysis. Proteomics The goal of this relatively recent branch of genomics is to study the proteomes of living organisms, that is, the set of proteins that are expressed at a given moment by a given cell, tissue or organism. Unlike to the nucleic acid genome, the dynamic aspects of protein expression and modification imply that a large number of different proteomes can be potentially expressed by a given organism. Each organism expresses number of different proteins across developmental time and space – from a few thousand in prokaryotes to tens of thousands in eukaryotes. This picture is further complicated by the fact that large numbers of protein variants are generated via such processes as transcript processing, post-translating modifications, to name a few. This diversity in turn renders the analysis of the proteomic data complex and makes the creation of fast and robust computing tools essential. Biological Sequence Analysis The purpose of biological sequence analysis is to determine in silico, and as automatically as possible, functional properties of a biological macromolecule directly from its nucleic acid sequence (dealing with genomic sequences) or of amino acid sequence (dealing with proteins). During evolution, DNA sequences are subject to mutations. These mutations can cause different consequences depending on where they occur. A mutation that does not affect the survival of an organism is likely to retained and passed-on to the next generations, whereas a mutation located to a site(s) critical for gene function or regulation may be catastrophic for the survival of the organism and thus may disappear before being passed to future generations. One thus concludes that mutations that play a part in critical biological functions are likely to be preserved through subsequent generations. By looking for similarities between sequences we hope to identify both variant and conserved sites. Microarray data analysis Microarrays are tools used for performing many hybridization experiments in parallel. As noted by Schliep, two main applications are considered for microarrays. First application is measuring the expression levels of thousands of genes simultaneously. Gene expression level is measured based on the amount of mRNA sequences bound or hybridized to their complementary sequences affixed on the surface of the microarray. The complementary sequences are called probes which are typically short DNA strands about 8 to 30 base pairs. The second important application of miocroarrays is the identifi.cation of unknown biological components in a sample. Knowing the sequences affixed on the microarray and considering the hybridization pattern of a sample, one can infer which targets exist in the sample by observing appropriate hybridization reactions. Heuristics optimization for the non-unique probe selection problemIn order to accurately measure the gene expression levels in microarray experiments, it is crucial to design unique, highly specific and sensitive oligonucleotide probes for the identification of biological agents such as genes in a sample. Unique probes are hard to obtain for closely related genes such as the known strains of HIV genes. The non-unique probe selection problem is to select a probe set that is able to uniquely identify targets, in a biological sample, while containing a minimal number of probes. This is a NP-hard problem and our contribution is four fast and very simple non-random greedy heuristics for finding near minimal non-unique probe sets. Our greedy methods substantially outperformed the only two know greedy heuristics in literature for non-unique probe selection problem. We also obtained results that are within 10% to 5% of the performances of the three recent state-of-the-art advanced methods, namely Integer Linear Programming, Optimal Cutting-Plane and Genetic Algorithm approaches. We apply for the first time an Estimation of Distribution Algorithm (EDA) named Bayesian Optimization Algorithm (BOA) to the problem of Non-unique Oligonucleotide Probe Selection, The proposed approach considers integration of BOA and one simple heuristic introduced for the non-unique probe selection problem. The results provided by this approach compare favorably with the state-of-the-art methods in the single target case. While most of the recent research works on this problem has been focusing on the single target case only, we present the application of our method in integration with decoding approach in a multi-objectives optimization framework for solving the problem in the case of multiple targets also taking into account the individual cost of each probe. By comparing the results of three-objective optimization method with the results of one and two-objective optimization methods, we understood that minimizing the design cost cannot be realized properly, unless we define an explicit objective for it. Moreover, our experiments on the decoding objective indicated that by applying the multi-objectives algorithm, we can obtain probe sets that have high ability of target identification. This study has been published in:
Inference of Molecular Phylogenies Thirty-six single genes of 6 plants inferred 18 unique trees using maximum parsimony. Such incongruence is an important challenge. How to reconstruct the congruent tree is still one of the most challenges in molecular phylogenetics. For resolving this problem, a genome-wide EST data mining approach was systematically investigated by retrieving a large size of EST data of 144 shared genes of 6 green plants from GenBank. The results show that the concatenated alignments approach overcame incongruence among single-gene phylogenies and successfully reconstructed the congruent tree of 6 species with 100% jackknife support across each branch when 144 genes was used. Jackknife supports of correct branches increased with number of genes linearly, but the number of wrong branches also increased linearly. For inferring the congruent tree, a minimum of 30 genes were required. This approach may provide potential power in resolving conflict in phylogenies. Since the discovery of the “living fossil” in 1938, the coelacanth (Latimeria chalumnae) has generally been considered to be the closest living relative of the land vertebrates, and this is still the prevailing opinion in most general biology textbooks. However, the origin of tetrapods has not been resolved for decades. Three principal hypotheses (lungfish-tetrapod, coelacanth-tetrapod, or lungfish-coelacanth sister group) have been proposed. We used the Bayesian method under the coalescence model with the latest published program (Bayesian Estimation of Species Trees, or BEST) to perform a phylogenetic analysis for seven relevant taxa and 43 nuclear protein-coding genes with the jackknife method for taxon sub-sampling. The lungfish-coelacanth sister group was consistently reconstructed with the Bayesian method under the coalescence model in 17 out of 21 taxon sets with a Bayesian posterior probability as high as 99%. Lungfish-tetrapod was only inferred from BCLS and BACLS. Neither coelacanth-tetrapod nor lungfish coelacanth-tetrapod was predicted out of all 21 taxon sets. Our results provide strong evidence in favor of accepting the hypothesis that lungfishes and coelacanths form a monophyletic sister that is the closest living relative of tetrapods supported by high Bayesian posterior probabilities of the branch (a lungfish-coelacanth clade) and high taxon jackknife supports. This study has been published in:
Combinatorial optimization strategies for ungapped local multiple alignmentUngapped Local Multiple Alignment (ULMA) is a widely used procedure in bioinformatics. It roughly consists of locating in a given set of nucleotide (DNA) or amino acids (Proteins) sequences, a number of non-overlapping fixed-size factors (also called occurrences), which are likely to have evolved from a common ancestor. This is done by optimizing the level of similarity between the factors. We defined the problem from a pure combinatorial optimization point of view. We showed that this approach is able to reduce the number of objective function evaluations, as well as the total CPU-time, compared to simple multi-start hill-climbing from random seeds. We have also conceived a new objective function to model the similarity level between factors: the Covering Entropy. It has the property to take into account, contrary to any other information content measures, the fact that some amino acid substitutions occur more frequently than others. We have proven, on dedicated artificial and real data, that the Covering Entropy presents significant advantages, either considering the capacity to measure the signal (biological pertinence) or considering its effect on the improvement in efficiency of the exploration. As an example, it allows the reduction by 100 fold the number of points to consider during the exploration. This work has led to the development of a tool available online. This tool has been and is currently being used in several biological projects. There also have been tenths of thousand queries on the tool. This study has been published in:
Modified peptide identification by tandem mass spectrometryProtein identification by tandem mass spectrometry (MS/MS) is key to most proteomics projects and has been widely explored in bioinformatics research. Obtaining good and trusted identification results has important implications for biological and clinical work. Although well matured, automated software identification of proteins from MS/MS data still faces a number of obstacles due to the complexity of the proteome or to procedural issues of mass spectrometry data acquisition. Expected or unexpected modifications of the peptide sequences, polymorphisms, errors in databases, missed or non-specific cleavages, unusual fragmentation patterns and co-eluting peptides are many pitfalls for identification algorithms. A lot of research work has been carried out in recent years; this has given rise to new strategies designed to handle a number of these issues. In this work, we describe a new approach for identifying and characterizing modified peptides using MS/MS data. Our method, Popitam, does not require giving a list of suspected modifications and is therefore classified as an “open-modification search” algorithm. The size of the search space is much greater with such an approach, so we also focused our work on the conception of optimized scoring functions. We defined a set of subscores describing different aspects of similarity between an MS/MS spectrum and a theoretical sequence. Then we used Genetic Programming to explore the different ways of combining these subscores, so that the built scoring function can efficiently identify and discriminate the correct peptide amongst a list of candidate peptide sequences. This work has led to the development of a tool, the first to be able to identify modified proteins without any previous information about the type of modification to consider. It also proposes “scenarios” of possible modifications associated to the sequence of the identified protein. It is available online.This work led to the submission of an international patent n°02743517.1-1225-IB0202731 “peptide and protein identification method”. This study has been published in:
The Molecular ScannerThe development of high-throughput utilities to identify proteins continues to be a major challenge in present proteomic research. In one such utility, the molecular scanner, a technique developed by the Geneva Proteome Center of the “Hôpital Cantonal Universitaire” of the University of Geneva, proteins separated by 2-dimensional polyacrylamide gel electrophoresis (2D-PAGE) are digested, first in the gel and then during transfer onto a collecting membrane. A peptide mass fingerprint (PMF) is then measured by mass spectrometry for every scanned site. Since the spacing between scanned sites is much smaller than the size of the protein spots, these spots cannot be missed and the redundancy in the data can be used to improve the quality of the identifications.Visual examination revealed that masses detected in the mass spectra showed specific spatial patterns. Some of these masses were found almost at every scanned site, whereas other masses showed a localised pattern. In the former case the masses could be attributed to chemical noise, which were not relevant to protein identification and could therefore be discarded. In the latter case, masses stemmed from peptides localised in one or several spots, and these peptide masses contained the information about the proteins present in the sample. It was also observed that the signal intensity of a peptide mass followed more or less the protein concentration, which makes it a useful indicator in finding protein spots and their centres, whereas these spots were not apparent if only the total signal intensity of the spectra (total ion chromatogram (TIC)) was considered. Furthermore, peptide masses could be linked to a spot or group of spots if the maxima of their intensity distributions are close to a spot centre. This allowed the identification of the masses belonging to a protein spot without PMF identification of the protein in a sequence database, which is useful for many purposes. Finally, it could be observed that many false positives of PMF identification had matching peptide masses with different intensity distributions, because these masses may stem from different spots and chemical noise. Including the similarity of these peptide masses into the PMF scoring provided a much more selective identification compared with the straightforward approach, where identification is carried out for each scan point separately without taking into account the spatial correlation in the data. These methods provide a spot detection that is more sensitive than staining, and many spots could be detected and identified that had not yet been annotated. This study has been published in:
|







