9 - 10 July 2018
Centre for Genomic Regulation (CRG)
Centre de Cultura Contemporània de Barcelona (CCCB)
Andromeda 2.0: the first search engine that combines symmetric binomial probability-based scoring and tag-enhanced scoring, leading to improved proteome coverage.
Highly accurate database search engines are essential in shotgun proteomics workflows to identify proteins and their modifications in biological samples. Currently, the used scoring methods of most search engines only consider matching peak masses to measure the quality of the peptide sequence identification. However, there are more features available to increase the identification rate. With the search engine Andromeda 2.0, incorporated into MaxQuant, a powerful quantitative proteomics software package, we introduce a novel scoring method. Andromeda 2.0 supports two new scoring methods, the symmetric binomial score, which uses intensities predicted by a conventional and deep learning model and tag-based scoring. The additional scoring methods significantly increase the number of peptide identifications at fixed false discovery rate (FDR).
We extended the Andromeda search engine to support two new scoring methods. The symmetric binomial score uses predicted intensities for theoretical spectra. To predict the intensities, we integrated a deep learning neural network model into MaxQuant for most accurate predictions and a conventional neural network model allowing re-training on specific datasets to enable adapting the intensity prediction to specific datasets. We call the method symmetric, since the theoretical spectrum can be filtered by the top most intense peaks as well as the experimental spectrum. The tag-based score uses the number of matching neighboring peaks for the score calculation. This additional information allows assigning higher weights to consecutively matching peaks and down-weighing matches with no series-neighbors found.
To validate the symmetric binomial score we analyzed a peptide set from ProteomeTools, a large library of synthetic peptides, covering essentially all human gene products. We compared the identified peptide sequences to the ground truth of the synthetic dataset and could identify additional correct peptide sequences. Based on these sequences, we can demonstrate that predicted intensities for the theoretical spectra can shift the correct peptide sequence from second best scoring peptide spectrum match (PSM) to the top scoring PSM. We further assessed Andromeda 2.0 by analyzing a complex biological LC-MS/MS dataset of HeLa cell lysate and identified significantly more peptides at 1% FDR applying the symmetric binomial score.
For the intensity prediction analyzing the peptides from ProteomeTools and the HeLa dataset, we applied the deep learning model, which uses long short-term memory cells. However, for datasets that are more specific we additionally integrated in MaxQuant a conventional neural network model using a sliding window approach. Although the latter model is not as accurate as the deep learning model, it enables faster tuning and inference, making the predictions accessible to various types of data. As the deep learning model, the conventional model has no restrictions on the peptide length. To validate the tag-based score we analyzed a dataset of Human leukocyte antigen (HLA) class I peptides analyzed with several different fragmentation methods using unspecific digestion in the Andromeda search. For the score calculation, the weights for peak matches are optimized depending on whether the left and right neighbors in a fragment series also have been found. The values of the weights are between 0 and 1 and are optimized for each fragmentation method. With the tag-based scoring method, we increased the number peptide spectrum matches (PSMs) for all fragmentation types at an FDR of 1%.
Photo credits: Patrick Rüther, Fabiana Di Gianvincenzo