HiXCorr: a portable high-speed XCorr engine for high-resolution tandem mass spectrometry

Author Notes

Abstract

Summary: Peptide identification is an important problem in proteomics. One of the most popular scoring schemes for peptide identification is X_Corr (cross-correlation). Since calculating X_Corr is computationally intensive, a lot of efforts have been made to develop fast X_Corr engines. However, the existing X_Corr engines are not suitable for high-resolution MS/MS spectrometry because they are either slow or require a specific type of CPU. We present a portable high-speed X_Corr engine for high-resolution tandem mass spectrometry by developing a novel algorithm for calculating X_Corr. The algorithm enables X_Corr calculation 1.25–49 times faster than previous algorithms for 0.01 Da fragment tolerance. Furthermore, our engine is easily portable to any machine with different types of CPU because it is developed in C language. Hence, our X_Corr engine will expedite peptide identification by high-resolution tandem mass spectrometry.

Availability and implementation: Available at http://isa.hanyang.ac.kr/HiXCorr/HiXCorr.html.

Contact: [email protected]

Supplementary information: Supplementary data are available at Bioinformatics online.

1 Introduction

Proteomics (Wilkins et al., 1997) is the study of proteins, particularly expression, structures, functions and interactions of proteins. Because proteins play important roles in a human body, correct protein (sequence) identification (Steen et al., 2004) is very important. High-throughput protein identification is generally done by cleaving a protein into peptides, getting tandem mass (MS/MS) spectra of the peptides and analyzing the spectra to identify peptide sequences.

SEQUEST (Eng et al., 1994) is one of the most widely used computer programs for peptide identification from MS/MS spectrum analysis. It compares an experimental spectrum with theoretical spectra computationally created from sequences in peptide database, and finds the theoretical spectrum most similar to the experimental spectrum. To measure the similarity between the theoretical and experimental spectra, SEQUEST uses a sophisticated scoring scheme X_Corr (cross-correlation).

However, calculating X_Corr can be very slow and consumes most of the running time of SEQUEST. Thus, a lot of efforts have been made to overcome this speed issue. The original SEQUEST used fast Fourier transform algorithm (Cormen et al., 2001) to make the X_Corr calculation faster. Later, Crux (Eng et al., 2008) improved the calculation speed of X_Corr by using a precomputation table, which is also used in modern SEQUEST and TurboSEQUEST. Faster X_Corr calculation is performed by Tide (Diament and Noble, 2011). It was optimized for x86 machine by including the x86 assembly code. Later, a portable Tide was developed in C language with exact P-value computation capability. (Hobert and Noble, 2014). To distinguish these two Tide versions, we will call the earlier version with x86 assembly code Tide-x86 and the later portable version Tide-C. Modern processors have multicores and support multithreading. Comet (Eng et al., 2013), an open-source MS/MS search tool by X_Corr, supported multithreading for X_Corr calculation. Thus, the more processors and cores a machine has, the faster the Comet runs.

Nowadays, more and more spectra are being acquired by high-resolution mass spectrometers. For example, Q-Exactive Orbitrap hybrid mass spectrometers (Thermo Scientific, Bremen, Germany) generate massive MS/MS high-resolution spectra whose fragment ion mass accuracy is within 0.01 Da. In addition, ultra high-resolution spectra whose fragment ion mass accuracy is <0.01 Da are expected to be generated in the near future. For high-resolution MS/MS spectra, calculating X_Corr becomes much slower and consumes most of the running time of peptide identification program. For example, the X_Corr engines in Tide-x86 and Tide-C run 6.6 and 20 times slower, respectively, when the fragment tolerance is 0.01 Da than when the tolerance of 0.1 Da (Fig. 1a and Supplementary Table S1). Comet shows similar behavior as the resolution gets higher (Fig. 1b, Supplementary Table S2, and Supplementary Fig. S1).

Fig. 1.

(a) Compares the total running times of Tide-C, Tide-x86 and Tide-Hi and (b) compares the total running times of Comet-Sparse and Comet-Hi. The MS/MS data were generated by the Clinical Proteomic Tumor Analysis Consortium (NCI/NIH) and are explained in detail in the Supplementary Data

Open in new tab Download slide

The existing X_Corr engines run slower for high-resolution spectra because they require more memory as the resolution gets higher: They create an O (m/f)-sized mass bin array for X_Corr calculation where m is the precursor mass and f is the fragment ion mass accuracy. For example, for a low-resolution spectrum whose precursor mass is 1000 Da and fragment tolerance is 1 Da, they create an array whose size is around 1000. However, for a high-resolution spectrum whose precursor mass is 1000 Da and fragment tolerance is 0.01 Da, they create an array whose size is around 100 000. Comet suggested a partial solution for this. When it runs with “use_sparse_matrix=1” in the parameter file, it first creates a huge mass bin array and then compresses the array. We will call this Comet-Sparse.

2 Results

In this article, we present a portable hi-speed X_Corr engine, which does not create a mass bin array altogether, instead, calculates X_Corr directly from the peak list. Thus, it runs in O(p) time where p is the number of peaks in a spectrum, while all the previous engines are based on X_Corr algorithms running in O(m/f) time where m is the precursor mass and f is the fragment tolerance (pseudocodes are available in the Supplementary Data).

We compared our X_Corr engine with previous engines on a machine with an Intel Core i7-3770K CPU (3.50 GHz) and 32 GB RAM under the CentOS 6.6 operating system and the GNU C compiler 4.4.7. First, we implanted our X_Corr engine into Tide-C and named it Tide-Hi. We compared Tide-Hi, with Tide-C, and Tide-x86. Since Tide-x86 does not calculate the exact P-value, we compared them without exact P-value calculation. Figure 1a and Supplementary Table S1 show that Tide-Hi is 49 times faster than Tide-C in X_Corr calculation and 45 times faster in total running time when the fragment tolerance is 0.01 Da. The running time gap between Tide-Hi and Tide-C gets bigger as the resolution gets higher. Tide-Hi is even 1.25 times faster than Tide-x86 in both X_Corr calculation and total running time for 0.01 Da fragment tolerance. (Note that Tide-Hi is developed in C language and Tide-x86 includes x86 assembly code.) Second, we implanted our X_Corr engine into Comet-Sparse and named it Comet-Hi. (Comet without sparse option requires much more memory to run on high-resolution data.) Figure 1b and Supplementary Table S2 show that Comet-Hi runs 2.4 times faster than Comet-Sparse for 0.01 Da fragment tolerance when eight threads were enabled. The gap between Comet-Hi and Comet-Sparse also gets bigger as the resolution gets higher when eight threads were used. Supplementary Figure S1 shows similar patterns for one, two and four threads.

3 Conclusion

We present a portable high-speed X_Corr engine for high-resolution tandem mass spectrometry by developing a novel algorithm, which enables X_Corr calculation 1.25–49 times faster than before for 0.01 Da fragment tolerance. When the fragment tolerance is 0.001 Da, our engine runs 1000 times faster than Tide-C’s X_Corr engine, 20 times faster than Comet-Sparse’s and 11 times faster than Tide-x86’s X_Corr engine (Fig. 1 and Supplementary Data). Furthermore, our engine is easily portable to almost every machine because it is developed in C. Optimizing our engine for x86 machines by embedding an x86 machine code can be a future research topic. Since X_Corr score is widely used in peptide identification, this article may be useful for the community. Finally, we did not trade correctness for efficiency. Our X_Corr engine calculates the same X_Corr score as Tide and Comet do (Supplementary Theorem 2).

Funding

This work was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (2012-0006999) and also by the National Research Foundation of Korea [NRF-2012M3A9B9036676, NRF-2014R1A2A1A11054147, NRF-2012M3A9D1054452].

Conflict of Interest: none declared.

References

Cormen

T.H.

et al. . (

2001

)

Introduction to Algorithms

, 2nd edn.

MIT Press

Cambridge, MA

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

Diament

B.J.

Noble

W.S.

(

2011

)

Faster SEQUEST searching for peptide identification from tandem mass spectra

J. Proteome Res.

3871

–

3879

Eng

J.K.

et al. . (

1994

)

An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database

J. Am. Soc. Mass Spectrom.

976

–

989

Eng

J.K.

et al. . (

2008

)

A fast SEQUEST cross correlation algorithm

J. Proteome Res.

4598

–

4602

Eng

J.K.

et al. . (

2013

)

Comet: an open-source MS/MS sequence database search tool

J. Proteomics

–

Google Scholar

Crossref

WorldCat

Hobert

J.J.

Noble

W.S.

(

2014

)

Computing exact p-values for a cross-correlation shotgun proteomics score function

J. Mol. Cell. Proteomics

2467

–

2479

Google Scholar

Crossref

WorldCat

Steen

et al. . (

2004

)

The ABC’s (XYZ’s) of peptide sequencing

Nat. Rev. Mol. Cell Biol.

699

–

711

Wilkins

M.R.

et al. . (

1997

)

Proteome Research: New Frontiers in Functional Genomics

, 1st ed.

Springer

New York

Author notes

Associate Editor: Ziv Bar-Joseph

Download all slides

Month:	Total Views:
December 2016	3
January 2017	7
February 2017	12
March 2017	2
April 2017	6
May 2017	21
June 2017	20
July 2017	19
August 2017	29
September 2017	23
October 2017	18
November 2017	18
December 2017	24
January 2018	14
February 2018	17
March 2018	19
April 2018	21
May 2018	11
June 2018	15
July 2018	17
August 2018	17
September 2018	21
October 2018	6
November 2018	31
December 2018	10
January 2019	8
February 2019	14
March 2019	24
April 2019	25
May 2019	20
June 2019	15
July 2019	19
August 2019	9
September 2019	16
October 2019	19
November 2019	13
December 2019	20
January 2020	32
February 2020	26
March 2020	21
April 2020	12
May 2020	13
June 2020	19
July 2020	5
August 2020	4
September 2020	20
October 2020	16
November 2020	6
December 2020	4
January 2021	1
February 2021	10
March 2021	12
April 2021	10
May 2021	7
June 2021	3
July 2021	17
August 2021	7
September 2021	6
October 2021	6
November 2021	7
December 2021	5
January 2022	12
February 2022	3
March 2022	17
April 2022	20
May 2022	17
June 2022	15
July 2022	19
August 2022	22
September 2022	29
October 2022	29
November 2022	43
December 2022	9
January 2023	12
February 2023	8
March 2023	4
April 2023	8
May 2023	4
July 2023	7
August 2023	12
September 2023	5
October 2023	15
November 2023	10
December 2023	5
January 2024	8
February 2024	3
March 2024	18
April 2024	7
May 2024	5
June 2024	14
July 2024	4
August 2024	7
September 2024	12
October 2024	9
November 2024	1
December 2024	7
January 2025	7
February 2025	3
March 2025	9
April 2025	13

Article Contents

HiXCorr: a portable high-speed X_Corr engine for high-resolution tandem mass spectrometry

Abstract

1 Introduction

2 Results

3 Conclusion

Funding

References

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

Article Contents

HiXCorr: a portable high-speed XCorr engine for high-resolution tandem mass spectrometry Free

Abstract

1 Introduction

2 Results

3 Conclusion

Funding

References

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

This Feature Is Available To Subscribers Only

HiXCorr: a portable high-speed X_Corr engine for high-resolution tandem mass spectrometry