Abstract

Motivation

One of the core problems in the analysis of protein tandem mass spectrometry data is the peptide assignment problem: determining, for each observed spectrum, the peptide sequence that was responsible for generating the spectrum. Two primary classes of methods are used to solve this problem: database search and de novo peptide sequencing. State-of-the-art methods for de novo sequencing use machine learning methods, whereas most database search engines use hand-designed score functions to evaluate the quality of a match between an observed spectrum and a candidate peptide from the database. We hypothesized that machine learning models for de novo sequencing implicitly learn a score function that captures the relationship between peptides and spectra, and thus may be re-purposed as a score function for database search. Because this score function is trained from massive amounts of mass spectrometry data, it could potentially outperform existing, hand-designed database search tools.

Results

To test this hypothesis, we re-engineered Casanovo, which has been shown to provide state-of-the-art de novo sequencing capabilities, to assign scores to given peptide-spectrum pairs. We then evaluated the statistical power of this Casanovo score function, Casanovo-DB, to detect peptides on a benchmark of three mass spectrometry runs from three different species. In addition, we show that re-scoring with the Percolator post-processor benefits Casanovo-DB more than other score functions, further increasing the number of detected peptides.

1 Introduction

Tandem mass spectrometry is a high-throughput protein content analysis method that can be used to investigate a broad range of biological phenomena, including cellular mechanisms, disease progression, protein alterations, and protein-protein interactions. Each tandem mass spectrometry experiment produces as output a collection of spectra, where each spectrum ideally was generated by a distinct peptide sequence. Hence, a central analysis challenge is the peptide assignment problem, wherein each observed spectrum is assigned a corresponding peptide sequence. Correctly matching peptides to their respective spectra allows for accurate identification and quantification of peptides in a sample, which is critical for the aforementioned downstream applications of tandem mass spectrometry.

The standard approach for solving the peptide detection problem is database search. Database search algorithms solve the peptide detection problem by searching an observed MS2 spectrum against a pre-defined database of peptides. The query spectrum is compared, using some scoring function, to a theoretical spectrum for each database peptide within a given m/z tolerance, and the top-scoring peptide-spectrum match (PSM) is returned. Currently, all widely used database search engines use hand-designed score functions to evaluate the similarity between an observed spectrum and each candidate peptide from the database. The first such score function was the XCorr score, implemented in SEQUEST (Eng et al. 1994) and used by popular search engines such as Comet (Eng et al. 2012) and Tide (Diament and Noble 2011), but many others have been designed subsequently [reviewed in Nesvizhskii (2010)]. While differing somewhat in implementation details, all such score functions take the same general approach of measuring the number of shared fragment ions between the observed mass spectrum and an idealized theoretical spectrum for a given peptide. However, these score functions have been shown to be under powered (Keich and Noble 2015), poorly calibrated (Klammer et al. 2009), and unable to fully capture the complex fragmentation patterns governing the relationship between peptides and their fragment ion spectra (Zolg et al. 2018, Polasky et al. 2022). These limitations may lead to systematic under-detection of peptides in proteomics datasets, leaving valuable insights hidden in the data.

Traditional database search is feasible only when working with data with a corresponding, well-characterized proteome, such as human or model organism data. In some settings, including metaproteomics, antibody sequencing, paleoproteomics, and immunopeptidomics, such a database is not available. In these settings, the alternative solution to the peptide assignment problem is de novo peptide sequencing, in which the peptide sequence is inferred directly from the observed spectrum without the use of a peptide database. Recently, deep learning methods have achieved state-of-the-art performance on this de novo sequencing task (Tran et al. 2017, Karunratanakul et al. 2019, Yang et al. 2019, Qiao et al. 2021, Ge et al. 2022, Yilmaz et al. 2022, Eloff et al. 2023, Jin et al. 2023, Klaproth-Andrade et al. 2024, Lee and Kim 2024, Liu et al. 2023, Mao et al. 2023, Xu et al. 2023, Yang et al. 2024). These models are trained in a supervised fashion to predict a peptide sequence given an observed spectrum.

Here, we hypothesize that when trained on the de novo sequencing task, these deep models implicitly learn the relationship between spectra and their generating peptides. Thus, we propose using the confidence score assigned to predictions from a de novo sequencing model as a score function for database search. Because such models are trained from labeled peptide-spectrum pairs first identified via database search, this proposed score function bootstraps from existing hand-derived score functions, but further benefits from the massive quantities of annotated mass spectra available in public data repositories. Here, we show that leveraging this wealth of available data to learn a better score function can increase the power and sensitivity of existing database search algorithms. Specifically, a powerful score function is one that detects many peptides at a given false discovery rate (FDR), as estimated by a standard statistical model known as “target-decoy competition” (Elias and Gygi 2007). Here, we use a pre-trained, state-of-the-art de novo sequencing model, Casanovo (Yilmaz et al. 2022), to assign scores to given peptide-spectrum pairs during database search.

Our results suggest that, at a 1% FDR threshold, the Casanovo-DB score function consistently yields a larger number of detected peptides than existing score functions, including XCorr (Eng et al. 1994), the X! Tandem Hyperscore (Craig and Beavis 2004), and Andromeda (Cox et al. 2011). In particular, we identify between 31% and 40% more peptides than XCorr, 52%–69% more peptides than Hyperscore, and 88%–102% more peptides than Andromeda score on datasets from E.coli, yeast, and human samples.

2 Materials and methods

2.1 Casanovo-DB score function

Our proposed learned score function is derived from the deep learning model Casanovo (Yilmaz et al. 2022). Casanovo is trained on 30 million high confidence PSMs to solve the de novo sequencing problem, in which the amino acid sequence of a peptide is inferred directly from an observed mass spectrum. Casanovo treats the de novo sequencing problem as a sequence-to-sequence translation task, where the spectrum is represented as a sequence of peaks and the output is the predicted sequence of amino acids. The model consists of a transformer architecture popular in the field of natural-language processing (Vaswani et al. 2017), with a transformer encoder learning an in-context representation of the input mass spectrum and a transformer decoder to predict the next amino acid in the peptide sequence given the spectrum representation and the previously predicted amino acids. Inference with Casanovo is auto-regressive: the model predicts a peptide sequence one amino acid at a time, at each step choosing the most likely next amino acid given the current predicted prefix.

Here, we propose to instead use Casanovo to score only the peptides present in the protein database during database search. To score a given peptide spectrum match using Casanovo, rather than letting the model predict each amino acid, we instead input the true sequence of the peptide into the model and record the score assigned to each amino acid in the sequence. Thus, for a given input spectrum and a peptide sequence s1,s2,,sn, we obtain as output from Casanovo a list of predictions x1,x2,,xn, where each xi represents the model’s estimated likelihood that si is the next amino acid in the peptide given the spectrum and the prefix s1,,si1. To combine these per-amino-acid scores into a single PSM score, we take the geometric mean:

We choose to use the geometric mean of amino acid scores, rather than the arithmetic mean used by Casanovo, in order to more severely punish peptides where Casanovo is uncertain for some amino acids in the sequence. Unlike the de novo sequencing setting, in which only the highest scoring peptide for each spectrum is important, here we want the model to also accurately assign low scores to peptides which are not a good match for the given spectra. Thus, we found that penalizing low amino acid confidence scores more severely by using the geometric mean gave better experimental results.

Conceptually, this scoring procedure is analogous to the “teacher forcing” that is used during Casanovo training, and allows us to extract a measure of what the trained Casanovo model estimates is the plausibility that the given spectrum was generated by the candidate peptide. Higher scores indicate a greater confidence that the match is valid.

To run a full database search using the Casanovo-DB score function, the model scores each observed spectrum with respect to all candidate peptides in the database that fall within a user-specified mass range, specified in units of parts-per-million (ppm). This is a standard procedure for performing mass spectrometry database search. Casanovo produces as output an mzTab file (Griss et al. 2014) in which, for each spectrum, the top-scoring candidate peptide is reported along with its score. A preliminary implementation of Casanovo-DB is available under an Apache license at https://github.com/Noble-Lab/casanovo/tree/db_search.

2.2 Existing score functions

We selected three widely used search engines with distinct score functions to compare our method to: Tide search (Diament and Noble 2011), which uses the XCorr score function, SAGE (Lazear 2023), which uses the Hyperscore score function, and MaxQuant (Cox et al. 2011), which uses the Andromeda score function.

XCorr is the oldest score function, proposed in the very first database search algorithm, SEQUEST (Eng et al. 1994), and still used by the Tide and Comet search engines (Eng et al. 2012). XCorr scores a peptide-spectrum match as follows. First, the mass spectrum u is binned along the m/z axis and normalized. Second, a theoretical spectrum v is constructed for the peptide, containing constant intensity peaks for the b-, y-, and a-fragment ions along with neutral losses of ammonia and water. Finally, the similarity of u and v is calculated as

Thus, XCorr is the cross-correlation between the observed and theoretical spectra, giving a measure of their dot product similarity normalized by the background similarity observed between the two vectors at various bin offsets.

The Hyperscore takes a more probabilistic approach to scoring PSMs, comparing the number of observed fragment ions for the given peptide compared to the number expected purely by chance under a hyper-geometric distribution. Specifically, the Hyperscore counts the number of matched b-ions nb and y-ions ny, along with their unit-normalized intensities Ib and Iy, and computes:
The Andromeda score takes a similar approach, but uses the binomial distribution to assign significance to the number of observed fragment ions. The given mass spectrum is pre-processed by keeping only the q most intense peaks per 100 m/z/ window, and a theoretical spectrum u is constructed for the peptide containing the same set of fragment ions as used by XCorr. The Andromeda score is then calculated as
where n is the number of peaks in the theoretical spectra for the peptide and k is the number of matching peaks observed in the spectrum.

2.3 Performance measures

We evaluate each search engine by measuring the number of distinct peptides that are detected from a given dataset at a false discovery rate (FDR) threshold of 1%. FDR is estimated using a standard procedure, known as “target-decoy competition” (TDC) (Elias and Gygi 2007). Here, we use a “double competition” variant of TDC to estimate FDR at the peptide level (Lin et al. 2022) The procedure consists of the following steps:

  1. Digest the proteins in the given database to peptides using a pre-specified set of enzymatic digestion rules. The resulting peptides are referred to as “targets.”

  2. Shuffle each unique target peptide to create a corresponding decoy peptide.

  3. Score each observed spectrum with respect to all of the target and decoy peptides whose masses lie within a specified range of the spectrum’s associated precursor mass.

  4. For each spectrum, retain the single, top-scoring target or decoy peptide. We refer to each spectrum and its associated peptide as a peptide-spectrum match (PSM). This is the first competition, because the target and decoy peptides compete to be assigned to the spectrum.

  5. For each peptide that is matched to at least one spectrum, retain only the top-scoring PSM for that peptide.

  6. For each target peptide that is matched to at least one spectrum, compete it against the corresponding decoy peptide, retaining whichever of the two PSMs has a better score. Any target or decoy that is not matched to any spectrum automatically loses the competition.

  7. Rank the remaining peptides by score, from largest (i.e. best) to smallest. At each position k in the ranked list, estimate the FDR as min(1,Dk+1Tk), where Dk (respectively, Tk) is the number of decoy peptides (respectively, targets) with rank smaller than k.

  8. Select the largest value of k such that Dk+1Tk < α, for some specified random matching rate α. In this work, we used α=0.01.

We use Crema (Lin et al. 2023) to perform decoy-based FDR estimation on the PSM rankings provided by running a database search on the spectra.

As an auxiliary performance measure, we also compute for each score function and dataset the “target match percentage” (TMP) (Bai et al. 2016), which is defined as the proportion of spectra that are assigned to target peptides. Formally, the TMP is TD+T. The TMP is complementary to the number of peptides accepted at 1% FDR, because the TMP does not require that scores be calibrated across spectra. By independently evaluating the ranking for each spectrum, a PSM score function can perform well according to the TMP measure even if the top-scoring scores assigned to two different spectra are scaled differently and cannot be directly compared to one another.

Note that, for two TMP values to be comparable, they need to be computed on the same set of spectra. Due to differences in implementation details, different search engines filter out different spectra based on certain quality control metrics, and thus do not report the top scoring PSM for all input spectra. This makes it impossible to directly compare the TMP for the results. Therefore, we perform a TMP comparison only between Casanovo-DB and XCorr, which assign PSMs to all spectra in the dataset, making the TMP values comparable.

2.4 Benchmarking data

To evaluate the performance of the four score functions, we downloaded a publicly available dataset (ProteomeXchange ID: PXD028735) containing mass spectrometry runs for the organisms Escherichia coli (E.coli), Homo sapiens (human), and Saccharomyces cerevisiae (yeast) which is commonly used as a benchmark for mass spectrometry analysis pipelines (Van Puyvelde et al. 2022). Because these are MS/MS runs that were generated post-2019, we guarantee that there is no train/test leakage of spectra for Casanovo-DB, since Casanovo was trained on mass spectra generated and published prior to 2019 (Yilmaz et al. 2022). Furthermore, because Casanovo was trained from MassIVE-KB, which contains only peptides from human, the E.coli and yeast test sets avoid leakage at the level of peptide sequences. All of the selected data was generated using an Orbitrap instrument, run in data-dependent acquisition mode. The resulting .raw files were converted to .mgf files using MSConvert (Kessner et al. 2008). Peak picking for MS levels 1–2 using the “Vendor” algorithm was used to reduce noise. The human, yeast, and E.coli .fasta files used in all database searches were downloaded from UniProt on 11/6/23, 4:30 PM.

2.4.1 Database search parameters

To ensure a fair comparison for our performance benchmark, we ran each search engine with equivalent parameters wherever possible. All searches used a single, static modification, carbamidomethylation of cysteine (C+57.02146). Variable modifications included methionine oxidation (M+15.994915), asparagine and glutamine deamidation (N+0.984016, Q+0.984016), as well as four N-terminal modifications (X+42.010565, X+43.005814, X-17.026549, X+25.980265). The cleavage rule was trypsin with proline suppression ([RK]|{P}) and we set the allowed maximum missed cleavages to 1. Finally, the precursor ion mass tolerance and fragment ion mass tolerance were ±20 ppm each. Other than the above settings, we chose to leave all other options as their default values in all other search engines. For example, the maximum number of variable modifications per peptide was kept to search engine defaults. Furthermore, to allow reproducibility of our primary results, the annotated MGF files, which indicate which candidate peptides were scored for each observed spectrum, are available at https://zenodo.org/records/10825365.

In addition to running three separate search engines, we repeated the comparison to XCorr and Hyperscore using our own, in-house search engine. This search used the parameters outlined above but ensured that every spectrum is scored with respect to exactly the same set of candidate peptides. One caveat to this experiment is that, due to the way our fragment ion indexing works, our search engine only reports the top 20 scores per spectrum, so Casanovo-DB only scores 20 candidates per spectrum. A priori, Casanovo-DB is very unlikely to re-rank any PSM from >20 to the top-ranked position per spectrum, so we do not expect this limitation to impact the reported results.

2.4.2 Percolator re-scoring

We used Percolator version 3.06.01, as implemented in Crux version 4.1-5c7d0d1-2023-11-14 (Park et al. 2008), to perform semi-supervised PSM re-scoring. The feature list provided to Percolator for all search engines was minimal, consisting only of score, precursor m/z, and charge. All other options were left as defaults. This minimal use of Percolator simply calibrates all score functions with respect to charge and m/z. After re-ranking PSMs, Percolator automatically performs peptide-level FDR control and generates a peptide-level report assigning an FDR estimate to each peptide detection.

3 Results

3.1 Casanovo-DB is more powerful than existing score functions

In our first experiment, we aimed to compare the statistical power of four score functions to detect peptides from mass spectrometry data while controlling the false discovery rate. We carried out this analysis on data from three different species, E.coli, yeast, and human, and we used target-decoy competition to control the FDR. In each case, we find that the learned Casanovo-DB score function detects more peptides across a range of FDR thresholds in all three species (Fig. 1a–c). Specifically, at the standard 1% FDR threshold, we observe an average 88% improvement over the Andromeda score (E.coli: 88%, yeast: 75%, human: 102%), 57% improvement over the Hyperscore (E.coli: 69%, yeast: 50%, human: 52%), and 35% improvement over XCorr (E.coli: 31%, yeast: 35%, human: 40%).

Each figure plots the number of peptides detected as a function of FDR threshold for the (a) E.coli, (b) yeast, and (c) human datasets. In each plot, the series correspond to different score functions, and the 1% FDR threshold is highlighted with a red dashed line. (d) Similar to panel (c), but generated using our in-house search engine.
Figure 1.

Each figure plots the number of peptides detected as a function of FDR threshold for the (a) E.coli, (b) yeast, and (c) human datasets. In each plot, the series correspond to different score functions, and the 1% FDR threshold is highlighted with a red dashed line. (d) Similar to panel (c), but generated using our in-house search engine.

As a secondary validation of this result, we repeated the above experiment but using our own in-house search engine. The search engine implements the XCorr and Hyperscore score functions, and reports PSMs for scoring by Casanovo-DB. This experiment gives us better control over the selection of candidate peptides for each spectrum: rather than trying to match parameters of different search engines, we directly score a common set of PSMs. The results of this experiment (Fig. 1d) align well with those reported above: at the 1% FDR threshold we find that Casanovo-DB yields an average improvement of 33% over XCorr and 20% over the Hyperscore.

We also use a complementary measure, the target match percentage (TMP), to evaluate the quality of each score function. The TMP is the proportion of spectra whose top-scoring peptide is a target. This measure does not take into account how well calibrated a given score is across spectra. Hence, achieving a high TMP score is necessary but not sufficient to achieving good statistical power. We observe that Casanovo-DB again outperforms the next most powerful score function, matching more spectra to target peptides than XCorr in all three species (Table 1). This result implies that at least some of Casanovo-DB’s strong performance in Fig. 1 is due to its ability to rank the generating peptide above other candidate peptides, irrespective of score calibration across spectra. Note that it is not possible to calculate the TMP for Andromeda and Hyperscore because the corresponding search engines include filters that prevent scores from being assigned to some spectra.

Table 1.

Target match percentage (TMP) comparison between Casanovo-DB and Tide for the E.coli, human, and yeast datasets.a

OrganismCasanovo-DBXCorr
E.coli0.7070.679
Human0.8190.776
Yeast0.7600.725
OrganismCasanovo-DBXCorr
E.coli0.7070.679
Human0.8190.776
Yeast0.7600.725
a

Highest TMP on each dataset in bold. Casanovo-DB outperforms Tide for all organisms.

Table 1.

Target match percentage (TMP) comparison between Casanovo-DB and Tide for the E.coli, human, and yeast datasets.a

OrganismCasanovo-DBXCorr
E.coli0.7070.679
Human0.8190.776
Yeast0.7600.725
OrganismCasanovo-DBXCorr
E.coli0.7070.679
Human0.8190.776
Yeast0.7600.725
a

Highest TMP on each dataset in bold. Casanovo-DB outperforms Tide for all organisms.

An important component of Casanovo-DB is the use of the geometric, rather than arithmetic, mean of Casanovo’s per-amino-acid score. In the context of de novo sequencing, we have observed that these two methods for combining scores perform very similarly. However, in the database search setting, switching from arithmetic to geometric mean dramatically improves the performance of Casanovo-DB. For example, at a 1% FDR threshold, the number of peptides detected increases by 21% for the E.coli data, 6% for the yeast data, and 14% for the human data. We hypothesize that this difference arises due to the increased need for calibration in the database search setting. In the de novo setting, Casanovo can search for a relatively high quality peptide for every given spectrum, whereas in the database search setting, Casanovo is sometimes forced to score a spectrum with respect to a small set of candidates, none of which truly generated the spectrum. Hence, the database search setting requires comparing high quality PSMs with low quality PSMs, making calibration more important.

3.2 Percolator re-scoring benefits Casanovo-DB more than other score functions

In practice, database search tools are rarely used on their own. Instead, their output is typically post-processed using a tool such as Percolator (Käll et al. 2007), which uses a semi-supervised machine learning algorithm to re-score PSMs based on additional features of the peptide and spectrum. This re-scoring serves to address the calibration issues exhibited by many score functions by adjusting PSM scores based on confounding factors such as peptide length, precursor m/z, and charge. Because Casanovo-DB scores are derived from a deep learning model that was not specifically optimized to produce well calibrated scores, we hypothesized that Percolator re-scoring would benefit Casanovo-DB search results to a greater extent than other score functions.

To test this hypothesis, we repeated our comparison of score functions, but including a Percolator post-processing step in each case. The results (Fig. 2) support our hypothesis, with Percolator further increasing the number of additional peptides detected by Casanovo-DB at 0.01 FDR. Percolator improves the number of identifications for Casanovo-DB by an average of 9%, compared to improvements of 5% for XCorr, 8% for Hyperscore, and 21% for Andromeda. Andromeda’s strong improvement from Percolator suggests that it may also be poorly calibrated across spectra. Overall, these results further support the idea that Casanovo-DB’s power comes primarily from its ability to rank the generating peptide at the top of the list of candidates. Therefore, when coupled with Percolator, which improves the calibration of those scores across spectra, we achieve even better statistical power.

Each figure plots the number of peptides detected after Percolator post-processing as a function of FDR threshold for the (a) E.coli, (b) yeast, and (c) human datasets. In each plot, the series correspond to different score functions, and the 1% FDR threshold is highlighted with a red dashed line.
Figure 2.

Each figure plots the number of peptides detected after Percolator post-processing as a function of FDR threshold for the (a) E.coli, (b) yeast, and (c) human datasets. In each plot, the series correspond to different score functions, and the 1% FDR threshold is highlighted with a red dashed line.

3.3 Casanovo-DB peptide detections are consistent with existing score functions

Next we compared the set of peptide detections from Casanovo-DB to those obtained by the other three score functions. The existing score functions we benchmark against are widely used by the mass spectrometry community and give trusted results, so any new score function should produce results consistent with existing methods. Reassuringly, we observe a very high overlap between Casanovo-DB and the other score functions, with very few peptides reproducibly detected by other score functions but not by Casanovo-DB (Fig. 3). On the other hand, Casanovo-DB detects a large number of peptides (9442) that are not detected by other score functions at 1% FDR. Further investigation revealed that 6812 (72.1%) of the peptides detected only by Casanovo-DB are successfully detected by at least one other search engine when we relax the FDR threshold to 10%. Overall, these results indicate that using Casanovo-DB as a score function for database search has the potential to yield new discoveries while also reproducing the findings obtained from existing methods.

An upset plot showing the overlap in peptide detections at 1% FDR between Casanovo-DB, Tide, SAGE, and MaxQuant on the human dataset.
Figure 3.

An upset plot showing the overlap in peptide detections at 1% FDR between Casanovo-DB, Tide, SAGE, and MaxQuant on the human dataset.

To better understand the differences among the peptides detected by different methods, we separately analyzed three subsets of peptides: the 54 peptides that were detected by all three competing search methods but not CasanovoDB, the 9442 peptides that were detected by Casanovo but none of the other three methods, and the 13 495 peptides detected by all four methods. For each set, we investigated the peptide length distribution, the rate of variable modifications, and the frequencies of amino acids, overall and at the C-terminus. The results of this analysis (Supplementary Table S1) do not show striking differences among the various subsets of peptides in terms of amino acid distributions, either overall or at the C-terminus. However, we do observe a large shift in average peptide length, with much shorter peptides (9.3 amino acids on average) among the Casanovo-DB-only subset than in either of the other two subsets (15.3 and 18.5 amino acids). This is perhaps not surprising, because Casanovo-DB is able to infer peptide identities based not only on the presence of various types of fragmentation events, but also based on the relative heights of the associated peaks. The trends with respect to post-translational modifications are less clear, with the lowest proportion of modified peptides (11.4%) among the peptides identified by all four methods, the highest (37.1%) among the peptides identified by the other three search engines, and an intermediate proportion (22.3%) among peptides identified only by Casanovo-DB.

Finally, to further understand what new peptides are being detected by Casanovo-DB, we investigated how the performance of each search engine varies by peptide mass. Specifically, we segregated spectra according to their associated precursor m/z values to find those in the top quartile (704–1389 m/z) and bottom quartile (350–467 m/z). We then performed TDC separately on each subset. We find that all of the search engines tend to make mistakes on shorter peptides at a much higher rate, while performing significantly better on longer peptides (Fig. 4). We also performed a similar analysis, but including Percolator post-processing (Supplementary Fig. S1), and observed similar trends. Overall, this analysis explains why, before Percolator re-scoring, Casanovo-DB exhibits worse performance than existing score functions at very small FDR thresholds: there are a handful of very short or highly charged decoys which Casanovo-DB assigns a high PSM score to. This also explains why Percolator re-scoring improves our model by such a large margin, because Percolator is able to re-calibrate Casanovo-DB’s overconfident predictions on low m/z precursors.

Each figure plots the number of peptides detected as a function of FDR threshold for the E.coli dataset, broken down by whether the precursor had an m/z in (a) the bottom quartile range of 350–467 m/z or (b) the top quartile range of 704–1389 m/z. In each plot, the series correspond to different score functions, and the 1% FDR threshold is highlighted with a red dashed line. Casanovo-DB performs much worse on low m/z precursors compared to those with high m/z, exemplifying the calibration problems which are resolved by Percolator.
Figure 4.

Each figure plots the number of peptides detected as a function of FDR threshold for the E.coli dataset, broken down by whether the precursor had an m/z in (a) the bottom quartile range of 350–467 m/z or (b) the top quartile range of 704–1389 m/z. In each plot, the series correspond to different score functions, and the 1% FDR threshold is highlighted with a red dashed line. Casanovo-DB performs much worse on low m/z precursors compared to those with high m/z, exemplifying the calibration problems which are resolved by Percolator.

4 Discussion

Our results show that a trained de novo sequencing model can be directly repurposed to carry out mass spectrometry database search, yielding a greater number of detected peptides while controlling for false discoveries. Tandem mass spectrometry assays are commonly used to understand disease progression, accelerate drug development, and deepen our understanding of various proteomes. Database search is the workhorse method for detecting peptides and proteins from tandem mass spectrometry experiments carried out in hundreds of labs around the world. Therefore, increasing the statistical power of scoring functions in this workflow could lead to important discoveries in downstream analyses.

An outstanding challenge in mass spectrometry database search is detection of short peptides, since a relatively short theoretical spectrum provides limited statistical evidence for a good match to an observed spectrum. Our analysis in Fig. 4 suggests that this challenge remains, even for CasanovoDB, although the mean peptide lengths reported in Supplementary Table S1 support the idea that Casanovo-DB gains some of its statistical power by doing a better job of detecting short peptides. An open question is whether a fine tuning strategy might be developed to further boost our ability to accurately detect short peptides.

The performance of Casanovo-DB is particularly noteworthy because it is, to our knowledge, the first score function to be learned purely from massive amounts of labeled mass spectrometry data. Hand designed score functions like XCorr struggle to capture the complexities of how peptides fragment in a mass spectrometer, and while post-processing tools can address calibration, they do not improve the power of the score function itself. The success of Casanovo-DB is also especially exciting given the fact that the original task that Casanovo was trained on (de novo sequencing) is only tangentially related to the problem of mass spectrometry database search. This suggests that fine-tuning a Casanovo model directly on the task of maximizing peptide detections during database search may further improve detection power.

While there are still a number of steps necessary to make Casanovo-DB usable in practice, in particular addressing the computational cost and hardware requirements inherent to deploying large deep learning models, our results are a compelling first step toward using a fully learned score function to improve the power of database search. With further optimization and fine-tuning, we see the potential to improve standard mass spectrometry analysis pipelines with a learned PSM score function that outperforms traditional hand-designed score functions by a large margin.

Author contributions

J.S. and W.S.N. conceived of the project. V.A. developed the software. V.A., J.S., M.Y., S.O., and W.S.N. planned experiments and analyzed the results. B.W. helped benchmark existing search engines. V.A., J.S., and W.S.N. wrote the manuscript.

Supplementary data

Supplementary data are available at Bioinformatics online.

Conflict of interest

None declared.

Funding

This work was supported in part by National Science Foundation award [2245300].

References

Bai
W
,
Bilmes
JA
,
Noble
WS.
Bipartite matching generalizations for peptide identification in tandem mass spectrometry. In: ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, Seattle, WA, New York, NY, USA: Association for Computing Machinery
2016
,
327
36
.

Cox
J
,
Neuhauser
N
,
Michalski
A
et al.
Andromeda: a peptide search engine integrated into the MaxQuant environment
.
J Proteome Res
2011
;
10
:
1794
805
.

Craig
R
,
Beavis
RC.
Tandem: matching proteins with tandem mass spectra
.
Bioinformatics
2004
;
20
:
1466
7
.

Diament
B
,
Noble
WS.
Faster SEQUEST searching for peptide identification from tandem mass spectra
.
J Proteome Res
2011
;
10
:
3871
9
.

Elias
JE
,
Gygi
SP.
Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry
.
Nat Methods
2007
;
4
:
207
14
.

Eloff
K
,
Kalogeropoulos
K
,
Morell
O
et al. De novo peptide sequencing with InstaNovo: Accurate, database-free peptide identification for large scale proteomics experiments. bioRxiv, https://doi.org/10.1101/2023.08.30.555055,
2023
, preprint: not peer reviewed.

Eng
JK
,
McCormack
AL
,
Yates
JR.
III
.
An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database
.
J Am Soc Mass Spectrom
1994
;
5
:
976
89
.

Eng
JK
,
Jahan
TA
,
Hoopmann
MR.
Comet: an open source tandem mass spectrometry sequence database search tool
.
Proteomics
2012
;
13
:
22
4
.

Ge
C
,
Lu
Y
,
Qu
J
et al. DePS: an improved deep learning model for de novo peptide sequencing. arXiv, arXiv:2203.08820,
2022
, preprint: not peer reviewed.

Griss
J
,
Jones
AR
,
Sachsenberg
T
et al.
The mzTab data exchange format: communicating mass-spectrometry-based proteomics and metabolomics experimental results to a wider audience
.
Mol Cell Proteomics
2014
;
13
:
2765
75
.

Jin
Z
,
Xu
S
,
Zhang
X
et al. ContraNovo: a contrastive learning approach to enhance de novo peptide sequencing. arXiv, arXiv:2312.11584,
2023
, preprint: not peer reviewed.

Käll
L
,
Canterbury
J
,
Weston
J
et al.
A semi-supervised machine learning technique for peptide identification from shotgun proteomics datasets
.
Nat Methods
2007
;
4
:
923
5
.

Karunratanakul
K
,
Tang
H-Y
,
Speicher
DW
et al.
Uncovering thousands of new peptides with sequence-mask-search hybrid de novo peptide sequencing framework
.
Mol Cell Proteomics
2019
;
18
:
2478
91
.

Keich
U
,
Noble
WS.
On the importance of well calibrated scores for identifying shotgun proteomics spectra
.
J Proteome Res
2015
;
14
:
1147
60
.

Kessner
D
,
Chambers
M
,
Burke
R
et al.
Proteowizard: open source software for rapid proteomics tools development
.
Bioinformatics
2008
;
24
:
2534
6
.

Klammer
AA
,
Park
CY
,
Noble
WS.
Statistical calibration of the Sequest XCorr function
.
J Proteome Res
2009
;
8
:
2106
13
.

Klaproth-Andrade
D
,
Hingerl
J
,
Bruns
Y
et al.
Deep learning-driven fragment ion series classification enables highly precise and sensitive de novo peptide sequencing
.
Nat Commun
2024
;
15
:
151
.

Lazear
MR.
Sage: an open-source tool for fast proteomics searching and quantification at scale
.
J Proteome Res
2023
;
22
:
3652
9
.

Lee
S
,
Kim
H.
Bidirectional de novo peptide sequencing using a transformer model
.
PLoS Comput Biol
2024
;
20
:
e1011892
.

Lin
A
,
Short
T
,
Noble
WS
et al.
Improving peptide-level mass spectrometry analysis via double competition
.
J Proteome Res
2022
;
21
:
2412
20
.

Lin
A
,
See
D
,
Fondrie
WE
et al.
Target-decoy false discovery rate estimation using crema
.
Proteomics
2024
;
24
:
e2300084
.

Liu
K
,
Ye
Y
,
Li
S
et al.
Accurate de novo peptide sequencing using fully convolutional neural networks
.
Nat Commun
2023
;
14
:
7974
.

Mao
Z
,
Zhang
R
,
Xin
L
et al.
Mitigating the missing fragmentation problem in de novo peptide sequencing with a two stage graph-based deep learning model
.
Nat Mach Intell
2023
;
5
:
1250
60
.

Nesvizhskii
AI.
A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics
.
J Proteomics
2010
;
73
:
2092
123
.

Park
CY
,
Klammer
AA
,
Käll
L
et al.
Rapid and accurate peptide identification from tandem mass spectra
.
J Proteome Res
2008
;
7
:
3022
7
.

Polasky
DA
,
Geiszler
DJ
,
Yu
F
et al.
MSFragger-labile: A flexible method to improve labile PTM analysis in proteomics
.
Mol Cell Proteomics
2023
;
22
:
100538
.

Qiao
R
,
Tran
NH
,
Xin
L
et al.
Computationally instrument-resolution-independent de novo peptide sequencing for high-resolution devices
.
Nat Mach Intell
2021
;
3
:
420
5
.

Tran
NH
,
Zhang
X
,
Xin
L
et al.
De novo peptide sequencing by deep learning
.
Proc Natl Acad Sci USA
2017
;
114
:
8247
52
.

Van Puyvelde
B
,
Daled
S
,
Willems
S
et al.
A comprehensive LFQ benchmark dataset on modern day acquisition strategies in proteomics
.
Sci Data
2022
;
9
:
126
. https://doi.org/10.1038/s41597-022-01216-6

Vaswani
A
,
Shazeer
N
,
Parmar
N
et al.
Attention is all you need
.
Adv Neural Inf Process Syst
2017
;
30
:
6000
10
.

Xu
X
,
Yang
C
,
He
Q
et al.
PGPointNovo: an efficient neural network-based tool for parallel de novo peptide sequencing
.
Bioinform Adv
2023
;
3
:
vbad057
.

Yang
H
,
Chi
H
,
Zeng
W
et al.
pNovo 3: precise de novo peptide sequencing using a learning-to-rank framework
.
Bioinformatics
2019
;
35
:
i183
90
.

Yang
T
,
Ling
T
,
Sun
B
et al.
Introducing π-HelixNovo for practical large-scale de novo peptide sequencing
.
Briefings in Bioinformatics
2024
;
25
:
bbae021
.

Yilmaz
M
,
Fondrie
WE
,
Bittremieux
W
et al. De novo mass spectrometry peptide sequencing with a transformer model. In: Proceedings of the 39th International Conference on Machine Learning - ICML '22, Baltimore MD, USA,
2022
,
25514
22
.

Zolg
DP
,
Wilhelm
M
,
Schmidt
T
et al.
ProteomeTools: systematic characterization of 21 post-translational protein modifications by liquid chromatography tandem mass spectrometry (LC-MS/MS) using synthetic peptides
.
Mol Cell Proteomics
2018
;
17
:
1850
63
.

Author notes

Varun Ananth and Justin Sanders Equal contribution.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Supplementary data