-
PDF
- Split View
-
Views
-
Cite
Cite
Zhaohui Zhan, Lusheng Wang, Proteoform identification and quantification based on alignment graphs, Bioinformatics, Volume 41, Issue 1, January 2025, btaf007, https://doi.org/10.1093/bioinformatics/btaf007
- Share Icon Share
Abstract
Proteoforms are the different forms of a proteins generated from the genome with various sequence variations, splice isoforms, and post-translational modifications. Proteoforms regulate protein structures and functions. A single protein can have multiple proteoforms due to different modification sites. Proteoform identification is to find proteoforms of a given protein that best fits the input spectrum. Proteoform quantification is to find the corresponding abundances of different proteoforms for a specific protein.
We proposed algorithms for proteoform identification and quantification based on the top-down tandem mass spectrum. In the combination alignments of the HomMTM spectrum and the reference protein, we need to give a correction of the mass for each matched peak within the pre-defined error range. After the correction, we impose that the mass between any two (not necessarily consecutive) matched nodes in the protein is identical to that of the corresponding two matched peaks in the HomMTM spectrum. We design a back-tracking graph to store such kind of information and find a combinatorial path (k paths) with the minimum sum of peak intensity error in this back-tracking graph. The obtained alignment can also show the relative abundance of these proteoforms (paths). Our experimental results demonstrate the algorithm’s capability to identify and quantify proteoform combinations encompassing a greater number of peaks. This advancement holds promise for enhancing the accuracy and comprehensiveness of proteoform quantification, addressing a crucial need in the field of top-down MS-based proteomics.
The software package are available at https://github.com/Zeirdo/TopMGQuant.
1 Introduction
In top-down MS, multiplexed tandem mass (MTM) spectra are ubiquitous. An MTM spectrum arises when two or more proteoforms are analyzed using tandem MS, where these proteoforms remain indistinguishable within the spectrum and can not be well-separated by separation methods due to their similar masses (DiMaggio et al. 2009). The presence of the MTM spectra is not only due to the intrinsic complexity of the proteome, where a single gene can give rise to numerous proteoforms through processes such as alternative splicing, PTMs, and allelic variations but also due to limitations inherent to biological instrumentation (Zhu and Liu 2018). The high-throughput MS, while advantageous for proteomic research, often results in the co-ionization and co-fragmentation of proteoforms. This will also lead to MTM spectra where peaks corresponding to different proteoforms may overlap (Wang et al. 2010).
However, the study of these MTM spectra is critically important in top-down proteomic analysis (Wang et al. 2014). The ability to identify distinct proteoforms in an MTM spectrum and to accurately measure their relative abundances is crucial for a comprehensive understanding of the proteome’s functional landscape (Zhu and Liu 2018). This task is challenging due to the spectral overlap of co-eluting proteoforms, which can mask the presence of less abundant species and confound quantitative assessments. Precise identification and quantification are essential not only for elucidating the biological roles of individual proteoforms but also for identifying potential biomarkers and understanding the molecular underpinnings of diseases (Schaffer et al. 2019). The accurate representation of proteoform abundance is also critical for comparative studies, such as those investigating differential protein expression under various physiological conditions or in response to therapeutic interventions (Cesnik et al. 2018). Therefore, the development of methodologies that can effectively identify and quantify proteoforms from complex MTM spectra is a critical step toward top-down MS in proteomics research.
MTM spectra can be categorized into two types: heterogeneous MTM (HetMTM) spectra and homogeneous MTM (HomMTM) spectra. HetMTM spectra originate from the proteoforms of two or more different proteins, whereas HomMTM spectra are derived from proteoforms of a single protein that exhibit variations in their PTM patterns (Zhu and Liu 2018). The challenge with HomMTM spectra lies in the precise identification for the differences between the proteoforms and quantification for the relative abundance, which requires a detailed understanding of the potential modifications and their mass shifts (Kafader et al. 2020). Here, our research is based on HomMTM spectra. The task for HomMTM spectra analysis is multifaceted. Firstly, it involves identifying the original protein from which the proteoforms are arisen. This step is crucial as it is the guarantee for further characterization of the proteoforms and fundamental interpretation of the subsequent variations observed in the proteoforms. Once the original protein is identified, the next challenge is to determine the specific variant forms of each proteoform present in the HomMTM spectrum. In addition to identifying the proteoforms and their modifications, another critical aspect of the analysis is to quantify the relative abundance of each proteoform within the HomMTM spectrum. Accurate quantification is essential for understanding the biological significance of each variant, as changes in the abundance of specific proteoforms can be indicative of physiological or pathological processes.
Numerous studies have already delved into the computational analysis based on HomMTM spectrum (Wang et al. 2010, 2014). For instance, Zhu et al. proposed a graph-based algorithm for the identification and quantification of proteoforms. They referred to the maximum k-splittable flow (MkSF) problem and transformed the HomMTM spectrum quantification problem into a minimum error k-splittable flow (MEkSF) problem (Zhu and Liu 2018). By constructing a graph that represents all different proteoforms that arise from the same protein, the algorithm can efficiently identify and quantify precise proteoforms with relative abundances in a polynomial time. In these computational analysis algorithms, handling the mass errors of peaks in the input spectrum is a critical issue. Traditional methods use an error tolerance value and approximately regard the masses within this margin of error tolerance value to be equal (Kou et al. 2016). However, this could lead to significant error accumulation if positive (or negative) errors occur consecutively (for details, see the Supplementary Material S1). The problem is so severe that some existing software packages include a step to further refine the alignments (Kou et al. 2017), indicating the necessity for more efficient error handling strategies.
To address this, Zhan and Wang (2024) proposed a new proteoform identification algorithm for error correction of peaks. They introduced the offset of peaks for imposing the masses to be identical all the time. According to this new algorithm, we can obtain more reliable global alignments within a reasonable time and further overall the quality of proteomics analysis. It is also the first time that global optimal error correction alignments can be obtained using real data sets. Therefore, we think it is not only viable but also highly beneficial to extend this algorithm for the quantification of HomMTM spectra. By adopting the error correction algorithm for HomMTM spectra, we can more accurately account for the unique challenges presented in the multiple proteoforms identification and quantification of HomMTM spectra.
Here, we referred to this error correction algorithm and proposed an algorithm for proteoform quantification based on the top-down tandem mass spectrum. In the combination alignments of the HomMTM spectrum and the reference protein, we need to give a correction of the mass for each matched peak within the pre-defined error range. After the correction, we impose that the mass between any two (not necessarily consecutive) matched nodes in the protein is identical to that of the corresponding two matched peaks in the HomMTM spectrum. We design a back-tracking graph to store such kind of information and find a combinatorial path (k paths) with the minimum sum of peak intensity error in this back-tracking graph. The obtained alignment can also show the relative abundance of these proteoforms (paths). Our experimental results demonstrate the algorithm’s capability to identify and quantify proteoform combinations encompassing a greater number of peaks. This advancement holds promise for enhancing the accuracy and comprehensiveness of proteoform quantification, addressing a crucial need in the field of top-down MS-based proteomics.
2 Materials and methods
A protein may have multiple possible proteoforms. For example, phosphorylation (Ph) may occur at the serine (S), threonine (T), or tyrosine (Y) sites and acetylated (Ac) may occur at the lysine (K) sites. An example is given in Fig. 1A, where the original protein segment is TKMKY, there are totally 16 possible proteoforms and we list 7 of them as types 2, 3, …, 8 here. The possible mass shifts of those 7 proteoforms are given in Fig. 1B. We can see that a mass shift may correspond to more than one proteoform, e.g. both type 6 and type 7 have one Ac modification (at different locations).

An example of PTM pattern inference problem. (A) Eight possible proteoforms for the protein segment TKMKY. (B) The mass shifts, corresponding PTM patterns, and mass shift differences for these eight possible proteoforms.
The mass spectrum generated from the proteoforms of the protein contains a set of peaks with lots of noise, and each peak has a corresponding intensity. Figure 2A is a small example, where black peaks are noises. Each proteoform has a set of theoretical peaks indicating the masses of the segments. The theoretical peaks for types 6 and 7 proteoforms are given in Fig. 2B and C, respectively. To identify a proteoform, we need to find/select peaks from the input mass spectrum that best match the theoretical mass spectrum of those proteoforms. Figure 2D and E are the selected peaks for types 6 and 7 proteoforms. To identify a proteoform, we can obtain a sequence of matched peak pairs, e.g. (101, 102) (271, 270), (402, 402), (530, 529), (693, 693) for type 6 proteoform. Such a sequence of matched pairs is also referred to as an alignment. Note that the mass values of peaks from the mass spectrum have errors and the masses of matched peaks are allowed to be slightly different from the theoretical peaks.

(A) An example HomMTM spectrum containing both types 6 and 7 proteoforms in Fig. 1A. (B) The theoretical spectrum for type 6 proteoform. (C) The theoretical spectrum for type 7 proteoform. (D) The alignment of type 6 proteoform, where solid peaks are real peaks from the input mass spectrum and dotted peaks are theoretical. (E) The alignment of type 7 proteoform, where solid peaks are real peaks from the input mass spectrum and dotted peaks are theoretical. Note that, in both (D) and (E), the real peaks have masses slightly different from that of theoretical peaks. The intensities of real peaks are also slightly different from the theoretical ones.
2.1 Overview of the method
Given a HomMTM spectrum H, and its original protein sequence G, the task is to find the multiple proteoforms corresponding to the HomMTM spectrum H and their corresponding abundance. If the original protein sequence G is not known, one can use existing database search methods such as TopMG (Kou et al. 2017) or TopMGFast (Zhan and Wang 2024) to obtain G.
The first sub-task is to identify all possible proteoforms based on H and G. Our idea is to extend the dynamic programming algorithms for alignment of a spectrum mass graph with a protein mass graph, where each matched proteoform is represented by an alignment between some peaks in H and some nodes in G (Kou et al. 2017, Zhan and Wang 2024). The dynamic programming algorithms have an efficient way to compute the largest size alignments by using a matrix. The matrix can also be viewed as a graph, where each cell is viewed as a node. In fact, it is well-known that the matrix contains information for all kinds of alignments and thus all kinds of matched proteoforms. Here we design an algorithm to construct a sub-graph named backtracking graph based on the matrix, which contains all kind of matched proteoforms interesting to us. Once we have the backtracking graph representing all kinds of possible matched proteoforms, we will try to identify k (k = 2 in our paper) paths (alignments) in the backtracking graph to get the k reliable matched proteoforms by using the peak intensity values, and consequently obtain the values of abundance for the k matched proteoforms. The algorithm is given in Section 2.5.
2.2 Identification of proteoforms for a given protein
For a given protein sequence, the concept of proteoform mass graph (PMG for short) was proposed in Kou et al. (2017) to represent all possible proteoforms for the given protein in a concise way. In a PMG of a protein with n amino acids, there are n + 1 nodes organized from left to right in linear order corresponding to the left and right ends of the n amino acids. A black directed edge connecting two neighbor nodes corresponds to an amino acid, with the weight representing the mass of this amino acid. For every potential modification to an amino acid, an additional red edge is added with the modified mass between corresponding nodes. A PMG stores the corresponding normal protein with all its possible proteoforms.
A spectrum is formulated as a spectrum mass graph (SMG for short). In an SMG, each node corresponds to a unique peak, with the associated mass assigned as the weight. A directed edge is added between a pair of nodes, with its weight as the difference in the weights of the two corresponding nodes.
To simplify the further analysis, we transformed all edges in PMG and MMG into integers through scaling and rounding techniques (Liu et al. 2013).
Traditionally, the problem of finding a proteoform of a given protein is formulated as finding an alignment between the two graphs.
2.2.1 Alignment of PMG and SMG
Let x0, x1, …, xn be the n nodes in a PMG G and y0, y1, …, ym be the m nodes in a MMG H. We use mi to represent the mass value of node/peak yi in H. M = mm is the mass value of the proteoform corresponding to the input mass spectrum. Let δ be a user defined number for error tolerance. An alignment of G and H with length s is a sequence of matched pairs , where , and there exists a path between node and node in G with total mass in the range for .
In practice, we impose the condition that and to match the whole mass spectrum to a segment in G.
The above alignment definition imposes that the mass difference between two consecutive matched peaks is roughly the same (with error tolerance value δ) as the mass of the corresponding sub-path in G. This method may suffer from error accumulation, where k consecutive positive/negative errors in the alignment will lead to an error of between a pair of peaks in H. This is certainly evidence that such an alignment is not reliable.
To get more realistic alignments, Zhan and Wang (2024) proposed a model considering peak error correction.
2.2.2 Peak error correction alignment
A length r peak error correction alignment for a PGM G and SMG H is a sequence of triples , where is the error correction value for peak such that , where is the mass for the peak and is the mass of a path between node and node . Note that δ represents the error tolerance value for the mass of a peak. The proteoform identification problem is to find a set of error correction alignments with the maximum alignment size between G and H.
We denote the practical peak intensity of the node/peak yi in H as p(i). The most similar alignment for an alignment obtained from the individual alignment problem is an alignment with the theoretical relative abundance h such that the sum of peak intensity error between h and p(i) is minimized.
2.3 An algorithm for identification of proteoforms
To obtain a set of proteoforms best fitting the SMG, we want to compute the longest length peak error correction alignment between a PMG and an SMG. The following algorithm was proposed by Zhan and Wang (2024).
Let be the maximum size of an error correction alignment between the first i peaks in H and the first j nodes in G such that the corrected peak of yi has the mass value and yi in H matches xj in G. When computing for every and , a dynamic programming algorithm can be used to simplify the process. Here they re-compute the error correction range of ki to be for every peak yi to eliminate redundant calculations when computing , since there may exist an overlap between the error correction ranges of some close peaks.
Let d(s, j) denote the set of distinct path masses from node xs to node xj in G. For each mass , there exists a list of nodes in H that corresponds to peaks with mass values within the range . Define as the sorted list of such nodes. These sorted nodes can be aligned with node xs in G.
2.4 Constructing backtracking graph
In order to find all possible proteoforms that best fit the input mass spectrum, we constructed a graph during the backtracking process. Note that, we are interested in alignments ending with ym, the last peak with mass value M = mm. Thus, we will find a set containing all cells with the largest value among all cells for . If there exist k and such that , We only need to consider one of k and . The reason is that the range of k and is much smaller than the mass value of a possible (modified or unmodified) amino acid and different k values cannot lead to different optimal alignments (see Zhan and Wang 2022). We can do backtracking starting with all the cells in . For each preceding cell used in Equation (1) that leads to the maximum value of , we treat both and as nodes in the backtracking graph and add a directed edge from to . Note that for a fixed , there may exist more than one leading to the maximum value of . The backtracking graph contains all those s. The backtracking process stops when no preceding cell can be found. We then prune the backtracking graph B so that only nodes that are the descendants of cells will be kept in the backtracking graph. The above method will construct a backtracking graph B containing all proteoforms with mass M = mm. Since each node in B has a value, we use the values to decompose the nodes in B into layers, where layer t contains all the nodes with .
2.4.1 Two proteoforms have different total mass
In practice, the input mass spectrum may also contain peaks from different proteoforms of the same protein with total mass different from M. In that case, the total mass values of the different proteoforms should be M − x, where x (positive number) is the difference between two mass shifts. For example, in Fig. 1B, ifM is the mass for the type 4 proteoform, then and the mass value for type 5 proteoform is . Given a protein sequence and its possible modifications, we can easily construct the list of mass shift differences. See Fig. 1B (the rightmost column) as an example.
In this case, we will start with a node in and the other node such that is in the range and is the largest to construct the backtracking graph. Similarly, we organize all the nodes in the backtracking graph layer by layer based on their values.
A small example is given in the Supplementary Material S1.
2.5 Proteoform quantification
We compute peak error correction alignments that best fit the input mass spectrum to identify proteoforms. When a mass spectrum contains multiple proteoforms, we want to find the abundances of each proteoform. The intensities of matched peaks in the corresponding alignments can be used for proteoform quantification. Here, we consider the case, where there are two proteoforms.
Let be the error correction alignment for a proteoform P. Let p be the theoretical peak intensity for the proteoform. The total intensity error for the proteoform is defined as , where is the intensity of peak .
Intuitively, the proteoform quantification problem is to find two alignments A1 and A2 from the backtracking graph and the corresponding theoretical intensities p1 and p2 such that is minimized. The abundance of the proteoform corresponding to Ai is for i = 1 and 2.
Since the two proteoforms may share some common peaks, we have to define the intensity error for a pair of nodes in the backtracking graph. We consider two cases.
Note that p1 and p2 are not known. Let pmax be the maximum peak intensity for all the peaks involved in B. We can guess the values of both p1 and p2 as , where .
The algorithm is described as Algorithm 1.
The algorithm to the combinatorial alignment problem.
1: fordo
2: fordo
3: compute layer by layer.
4: end for
5: end for
The two proteoforms have total masses M and M − x for x > 0, respectively. In this case, x should be a given mass shift difference based on the input protein and its possible modifications. An example is given in Fig. 1B.
In order to obtain two proteoforms under Case 2, we will try to find two alignments corresponding to two paths from the two starting backtracking nodes and back to a node . The two paths may have different numbers of edges due to various reasons, e.g. loss of peaks, etc.
Without loss of generality, we assume that . (The other case is similar.) For any pair of nodes va and vb in the same layer i, for , we still use Equations (3) and (4) to compute and .
3 Results
First, we use simulated datasets to evaluate the prediction quality.
3.1 Prediction quality evaluation
To evaluate the quality of the proteoform quantification algorithm, we generate the simulated top-down HomMTM spectra using Top-down MaSS-Simulator, a package specifically designed to generate top-down spectra, based on selected protein sequences (see https://github.com/KunyiE1/TopDown-MaSS-Simulator).
To get the selected protein sequences, we randomly select 10 proteins from the whole Histone H4 protein database used in Liu et al. (2013) downloaded from UniProt (Accession number: P62805) including 291 protein entries. Then we use these 10 proteins to generate the simulated top-down HomMTM spectra and the size of amino acids contained in each protein is from 65 to 109.
Three mutations were used as variable PTMs: lysine (K) to cysteine (C) (UNIMOD Accession number: 1132), threonine (T) to alanine (A) (UNIMOD Accession number: 659), and valine (V) to glycine (G) (UNIMOD Accession number: 672). These three mutations were also gathered together in a txt format file as input. Note that the total masses of these two different proteoforms may be the same or differ by some mass shifts according to Case 1 and Case 2 described in Methods. In this way, for each of the simulated top-down HomMTM spectra, we know the real corresponding protein as well as the corresponding proteoforms.
3.1.1 Generating mixture top-down HomMTM spectra
To generate the top-down HomMTM spectra using the website https://github.com/KunyiE1/TopDown-MaSS-Simulator, we randomly select one kind of proteoform for the selected protein and convert it to the corresponding sequence file. Then we run the simulation program and get a simulated top-down spectrum file for the proteoform of the protein. To add peaks of the other proteoform of the same protein, we manually calculate the mass values of the theoretical peaks for the other kind of proteoform for the same protein, add some random mass errors within the error tolerance setting, and add these peaks to the simulated top-down spectrum file. Besides, we set the mean value of the peak intensities of one kind of proteoform to be x = 500 and the other kind of proteoform to be x, 2x, and 3x, respectively. The variances of the peak intensities are set to be 50 and 25, respectively. Therefore, for this simulated mixture top-down HomMTM spectra, the theoretical relative abundance of these two proteoforms should be 50% vs 50%, 33.3% vs 66.7%, and 25% vs 75%, respectively.
3.1.2 Identification and quantification results
For each mixture top-down HomMTM spectrum, we apply our method to directly align the spectrum with the corresponding protein. The detailed results are shown in Table 1. We can see that our algorithm can report the two correct proteoforms for all the simulated mixture HomMTM top-down spectra and the relative abundances for most proteoforms are also correct.
Proteoform identification and quantification results for simulated top-down HomMTM spectrum.
Protein . | Size . | # of peak1a . | # of peak2a . | # of mod1b . | # of mod2b . | Abundance . | Running time (ms) . |
---|---|---|---|---|---|---|---|
sp|P0AB83|END3_ECOLI | 65 | 27 | 23 | 0 | 3 | 59% and 41% | 2316.09 |
sp|P0ABU0|MENB_ECOLI | 68 | 18 | 17 | 0 | 3 | 50% and 50% | 1166.22 |
sp|P77690|ARNB_ECOLI | 74 | 16 | 14 | 0 | 3 | 51% and 49% | 1007.79 |
sp|P0A948|RIMJ_ECOLI | 77 | 14 | 23 | 0 | 3 | 41% and 59% | 957.40 |
sp|P37626|YHII_ECOLI | 85 | 23 | 18 | 0 | 3 | 67% and 33% | 1462.98 |
sp|P0DP22|YJIQ_ECOLI | 88 | 9 | 15 | 0 | 3 | 56% and 43% | 785.56 |
sp|P0AF03|MOG_ECOLI | 91 | 18 | 24 | 0 | 3 | 58% and 42% | 1752.53 |
sp|P10441|LPXB_ECOLI | 98 | 19 | 20 | 0 | 3 | 70% and 30% | 1725.71 |
sp|P05458|PTRA_ECOLI | 103 | 13 | 13 | 0 | 3 | 12% and 88% | 1004.92 |
sp|P0AFS1|LSRD_ECOLI | 109 | 17 | 20 | 0 | 3 | 78% and 22% | 1589.71 |
Protein . | Size . | # of peak1a . | # of peak2a . | # of mod1b . | # of mod2b . | Abundance . | Running time (ms) . |
---|---|---|---|---|---|---|---|
sp|P0AB83|END3_ECOLI | 65 | 27 | 23 | 0 | 3 | 59% and 41% | 2316.09 |
sp|P0ABU0|MENB_ECOLI | 68 | 18 | 17 | 0 | 3 | 50% and 50% | 1166.22 |
sp|P77690|ARNB_ECOLI | 74 | 16 | 14 | 0 | 3 | 51% and 49% | 1007.79 |
sp|P0A948|RIMJ_ECOLI | 77 | 14 | 23 | 0 | 3 | 41% and 59% | 957.40 |
sp|P37626|YHII_ECOLI | 85 | 23 | 18 | 0 | 3 | 67% and 33% | 1462.98 |
sp|P0DP22|YJIQ_ECOLI | 88 | 9 | 15 | 0 | 3 | 56% and 43% | 785.56 |
sp|P0AF03|MOG_ECOLI | 91 | 18 | 24 | 0 | 3 | 58% and 42% | 1752.53 |
sp|P10441|LPXB_ECOLI | 98 | 19 | 20 | 0 | 3 | 70% and 30% | 1725.71 |
sp|P05458|PTRA_ECOLI | 103 | 13 | 13 | 0 | 3 | 12% and 88% | 1004.92 |
sp|P0AFS1|LSRD_ECOLI | 109 | 17 | 20 | 0 | 3 | 78% and 22% | 1589.71 |
These two columns indicate the total numbers of matched peaks when aligning the mixture top-down HomMTM spectrum with the first and second proteoforms, respectively.
These two columns indicate the total numbers of modifications in the first and second proteoforms, respectively.
Proteoform identification and quantification results for simulated top-down HomMTM spectrum.
Protein . | Size . | # of peak1a . | # of peak2a . | # of mod1b . | # of mod2b . | Abundance . | Running time (ms) . |
---|---|---|---|---|---|---|---|
sp|P0AB83|END3_ECOLI | 65 | 27 | 23 | 0 | 3 | 59% and 41% | 2316.09 |
sp|P0ABU0|MENB_ECOLI | 68 | 18 | 17 | 0 | 3 | 50% and 50% | 1166.22 |
sp|P77690|ARNB_ECOLI | 74 | 16 | 14 | 0 | 3 | 51% and 49% | 1007.79 |
sp|P0A948|RIMJ_ECOLI | 77 | 14 | 23 | 0 | 3 | 41% and 59% | 957.40 |
sp|P37626|YHII_ECOLI | 85 | 23 | 18 | 0 | 3 | 67% and 33% | 1462.98 |
sp|P0DP22|YJIQ_ECOLI | 88 | 9 | 15 | 0 | 3 | 56% and 43% | 785.56 |
sp|P0AF03|MOG_ECOLI | 91 | 18 | 24 | 0 | 3 | 58% and 42% | 1752.53 |
sp|P10441|LPXB_ECOLI | 98 | 19 | 20 | 0 | 3 | 70% and 30% | 1725.71 |
sp|P05458|PTRA_ECOLI | 103 | 13 | 13 | 0 | 3 | 12% and 88% | 1004.92 |
sp|P0AFS1|LSRD_ECOLI | 109 | 17 | 20 | 0 | 3 | 78% and 22% | 1589.71 |
Protein . | Size . | # of peak1a . | # of peak2a . | # of mod1b . | # of mod2b . | Abundance . | Running time (ms) . |
---|---|---|---|---|---|---|---|
sp|P0AB83|END3_ECOLI | 65 | 27 | 23 | 0 | 3 | 59% and 41% | 2316.09 |
sp|P0ABU0|MENB_ECOLI | 68 | 18 | 17 | 0 | 3 | 50% and 50% | 1166.22 |
sp|P77690|ARNB_ECOLI | 74 | 16 | 14 | 0 | 3 | 51% and 49% | 1007.79 |
sp|P0A948|RIMJ_ECOLI | 77 | 14 | 23 | 0 | 3 | 41% and 59% | 957.40 |
sp|P37626|YHII_ECOLI | 85 | 23 | 18 | 0 | 3 | 67% and 33% | 1462.98 |
sp|P0DP22|YJIQ_ECOLI | 88 | 9 | 15 | 0 | 3 | 56% and 43% | 785.56 |
sp|P0AF03|MOG_ECOLI | 91 | 18 | 24 | 0 | 3 | 58% and 42% | 1752.53 |
sp|P10441|LPXB_ECOLI | 98 | 19 | 20 | 0 | 3 | 70% and 30% | 1725.71 |
sp|P05458|PTRA_ECOLI | 103 | 13 | 13 | 0 | 3 | 12% and 88% | 1004.92 |
sp|P0AFS1|LSRD_ECOLI | 109 | 17 | 20 | 0 | 3 | 78% and 22% | 1589.71 |
These two columns indicate the total numbers of matched peaks when aligning the mixture top-down HomMTM spectrum with the first and second proteoforms, respectively.
These two columns indicate the total numbers of modifications in the first and second proteoforms, respectively.
3.1.3 Evaluation on mixture bottom-up HomMTM spectra
We also evaluate the quality of the proteoform quantification algorithm on the simulated bottom-up HomMTM spectra. For more experiment details and results, see the Supplementary Material S1.
3.2 Case study
Here give a detailed explanation for the case in the third row of Table 1, which is the protein with 74 amino acids. Figure 3 shows the detailed graphical illustration of the quantification results.

The graphical illustration of the quantification results for case 2. The original protein is with 74 amino acids. The two numbers on each nodes represent the peak number and node number in the alignment, respectively.
As shown in Fig. 3, two proteoforms are identified and they are represented as two paths. For the proteform A, there is no reported PTM. For the proteoform B, the first PTM lysine (K) to cysteine (C) reported to occur in the third amino acid, the second PTM lysine (K) to cysteine (C) reported to occur in the 10th amino acid and the last PTM hreonine(T) to alanine (A) reported to occur in the 12th amino acid. Since these two proteoforms have different total masses, the ending node pairs of these two proteoforms are different. In addition, the abundance of the proteoform A is 51% and the abundance of the proteoform B is 49%. Under the condition of this combinatorial abundance, these two proteoforms have the minimum sum of peak intensity errors. Besides, this spectrum can align 15 more peaks when identified as these two combinatorial proteoforms than when identified as a single proteoform.
3.3 Results on real datasets
In this section, we apply our method to some real datasets. The mass spectra dataset is generated from the human Histone H4 protein downloaded from Liu et al. (2013). The reference protein database is download from UniProt (Liu et al. 2013) and it includes 291 reference proteins. For this Histone H4 dataset (database), two mutations were used as variable PTMs similar to Liu et al. (2013) and these two modifications are Phosphorylation (UNIMOD Accession number: 21) and di-Methylation (UNIMOD Accession number: 36). A txt format file was generated based on these two pre-defined mutations as part of the input.
3.3.1 Comparison with Zhu and Liu’s algorithm
Here, we compared our method with Zhu and Liu’s method proposed in Zhu and Liu (2018) using the Histone H4 real datasets introduced above. As suggested in Zhu and Liu (2018), an alignment with at least 10 matched peaks is considered reliable. With a cutoff of 10 matched peaks, our algorithm identified 5 spectra matched to single (reliable) proteoform and 1154 spectra matched to (reliable) proteoform pairs while Zhu and Liu’s algorithm identified 441 spectra matched to single proteoforms and 184 spectra matched to proteoform pairs.
To show some details, we select 10 mass spectra and used our method to find proteoforms. For each of the 10 cases, we have found two proteoforms and their corresponding relative abundances. The details are given in the Table 2.
Proteoform identification and quantification results for real top-down HomMTM spectrum.
Protein . | Size . | # of peak1a . | # of peak2a . | # of mod1b . | # of mod2b . | Abundance . | Running time (ms) . |
---|---|---|---|---|---|---|---|
sp|P0DPK2|H3Y1_HUMAN | 94 | 16 | 16 | 23 | 23 | 50% and 50% | 85 692.89 |
sp|O00213|APBB1_HUMAN | 101 | 13 | 13 | 10 | 10 | 41% and 59% | 44 631.51 |
sp|O75461|E2F6_HUMAN | 93 | 18 | 18 | 18 | 18 | 54% and 46% | 52 947.28 |
sp|O00268|TAF4_HUMAN | 37 | 9 | 9 | 7 | 9 | 38% and 62% | 23 077.81 |
sp|P0DPK5|H3Y2_HUMAN | 95 | 17 | 17 | 19 | 18 | 47% and 53% | 37 206.15 |
sp|Q03164|KMT2A_HUMAN | 93 | 27 | 27 | 16 | 15 | 50% and 50% | 2 61 174.37 |
sp|Q9P0M6|H2AW_HUMAN | 102 | 21 | 21 | 25 | 27 | 54% and 46% | 2 41 834.38 |
sp|Q14839|CHD4_HUMAN | 91 | 19 | 19 | 27 | 28 | 42% and 58% | 80 985.51 |
sp|Q93077|H2A1C_HUMAN | 101 | 17 | 17 | 8 | 9 | 36% and 64% | 79 641.00 |
sp|Q4FZB7|KMT5B_HUMAN | 96 | 21 | 21 | 22 | 22 | 52% and 48% | 48 161.31 |
Protein . | Size . | # of peak1a . | # of peak2a . | # of mod1b . | # of mod2b . | Abundance . | Running time (ms) . |
---|---|---|---|---|---|---|---|
sp|P0DPK2|H3Y1_HUMAN | 94 | 16 | 16 | 23 | 23 | 50% and 50% | 85 692.89 |
sp|O00213|APBB1_HUMAN | 101 | 13 | 13 | 10 | 10 | 41% and 59% | 44 631.51 |
sp|O75461|E2F6_HUMAN | 93 | 18 | 18 | 18 | 18 | 54% and 46% | 52 947.28 |
sp|O00268|TAF4_HUMAN | 37 | 9 | 9 | 7 | 9 | 38% and 62% | 23 077.81 |
sp|P0DPK5|H3Y2_HUMAN | 95 | 17 | 17 | 19 | 18 | 47% and 53% | 37 206.15 |
sp|Q03164|KMT2A_HUMAN | 93 | 27 | 27 | 16 | 15 | 50% and 50% | 2 61 174.37 |
sp|Q9P0M6|H2AW_HUMAN | 102 | 21 | 21 | 25 | 27 | 54% and 46% | 2 41 834.38 |
sp|Q14839|CHD4_HUMAN | 91 | 19 | 19 | 27 | 28 | 42% and 58% | 80 985.51 |
sp|Q93077|H2A1C_HUMAN | 101 | 17 | 17 | 8 | 9 | 36% and 64% | 79 641.00 |
sp|Q4FZB7|KMT5B_HUMAN | 96 | 21 | 21 | 22 | 22 | 52% and 48% | 48 161.31 |
These two columns indicate the total numbers of matched peaks when aligning the mixture top-down HomMTM spectrum with the first and second proteoforms, respectively.
These two columns indicate the total numbers of modifications in the first and second proteoforms, respectively.
Proteoform identification and quantification results for real top-down HomMTM spectrum.
Protein . | Size . | # of peak1a . | # of peak2a . | # of mod1b . | # of mod2b . | Abundance . | Running time (ms) . |
---|---|---|---|---|---|---|---|
sp|P0DPK2|H3Y1_HUMAN | 94 | 16 | 16 | 23 | 23 | 50% and 50% | 85 692.89 |
sp|O00213|APBB1_HUMAN | 101 | 13 | 13 | 10 | 10 | 41% and 59% | 44 631.51 |
sp|O75461|E2F6_HUMAN | 93 | 18 | 18 | 18 | 18 | 54% and 46% | 52 947.28 |
sp|O00268|TAF4_HUMAN | 37 | 9 | 9 | 7 | 9 | 38% and 62% | 23 077.81 |
sp|P0DPK5|H3Y2_HUMAN | 95 | 17 | 17 | 19 | 18 | 47% and 53% | 37 206.15 |
sp|Q03164|KMT2A_HUMAN | 93 | 27 | 27 | 16 | 15 | 50% and 50% | 2 61 174.37 |
sp|Q9P0M6|H2AW_HUMAN | 102 | 21 | 21 | 25 | 27 | 54% and 46% | 2 41 834.38 |
sp|Q14839|CHD4_HUMAN | 91 | 19 | 19 | 27 | 28 | 42% and 58% | 80 985.51 |
sp|Q93077|H2A1C_HUMAN | 101 | 17 | 17 | 8 | 9 | 36% and 64% | 79 641.00 |
sp|Q4FZB7|KMT5B_HUMAN | 96 | 21 | 21 | 22 | 22 | 52% and 48% | 48 161.31 |
Protein . | Size . | # of peak1a . | # of peak2a . | # of mod1b . | # of mod2b . | Abundance . | Running time (ms) . |
---|---|---|---|---|---|---|---|
sp|P0DPK2|H3Y1_HUMAN | 94 | 16 | 16 | 23 | 23 | 50% and 50% | 85 692.89 |
sp|O00213|APBB1_HUMAN | 101 | 13 | 13 | 10 | 10 | 41% and 59% | 44 631.51 |
sp|O75461|E2F6_HUMAN | 93 | 18 | 18 | 18 | 18 | 54% and 46% | 52 947.28 |
sp|O00268|TAF4_HUMAN | 37 | 9 | 9 | 7 | 9 | 38% and 62% | 23 077.81 |
sp|P0DPK5|H3Y2_HUMAN | 95 | 17 | 17 | 19 | 18 | 47% and 53% | 37 206.15 |
sp|Q03164|KMT2A_HUMAN | 93 | 27 | 27 | 16 | 15 | 50% and 50% | 2 61 174.37 |
sp|Q9P0M6|H2AW_HUMAN | 102 | 21 | 21 | 25 | 27 | 54% and 46% | 2 41 834.38 |
sp|Q14839|CHD4_HUMAN | 91 | 19 | 19 | 27 | 28 | 42% and 58% | 80 985.51 |
sp|Q93077|H2A1C_HUMAN | 101 | 17 | 17 | 8 | 9 | 36% and 64% | 79 641.00 |
sp|Q4FZB7|KMT5B_HUMAN | 96 | 21 | 21 | 22 | 22 | 52% and 48% | 48 161.31 |
These two columns indicate the total numbers of matched peaks when aligning the mixture top-down HomMTM spectrum with the first and second proteoforms, respectively.
These two columns indicate the total numbers of modifications in the first and second proteoforms, respectively.
4 Conclusion
We proposed an algorithm for proteoform quantification with peak error corrections based on top-down tandem mass spectrum and produced a program package in C++. Our experimental results on real-world data demonstrate the capability to quantify proteoform combinations with at most two different proteoforms. This is a small step for proteoform quantification, which is very much needed in the field of top-down MS-based proteomics. Still, it is an interesting open problem to design algorithms for identifying and quantifying mixture spectra with larger number of proteoforms.
Author contributions
Both Zhaohui Zhan and Lusheng Wang are involved in all aspects of the project. In particular, Zhaohui Zhan focuses on the implementation of the software package and reporting the experiment results. Lusheng Wang is responsible for algorithm design, experiment design, manuscript writing, etc.
Supplementary data
Supplementary data are available at Bioinformatics online.
Conflict of interest: None declared.
Funding
This work was fully supported by funds from the National Science Foundation (NSF: 61972329), GRF Grants for Hong Kong Special Administrative Region, P. R. China (CityU 11210119, and CityU 11206120) and a Strategic Research Grant from City University of Hong Kong (Project Number 7005851).
Data availability
All source code and experiment data are available at https://github.com/Zeirdo/TopMGQuant.