Abstract

Identifying driver genes, exactly from massive genes with mutations, promotes accurate diagnosis and treatment of cancer. In recent years, a lot of works about uncovering driver genes based on integration of mutation data and gene interaction networks is gaining more attention. However, it is in suspense if it is more effective for prioritizing driver genes when integrating various types of mutation information (frequency and functional impact) and gene networks. Hence, we build a two-stage-vote ensemble framework based on somatic mutations and mutual interactions. Specifically, we first represent and combine various kinds of mutation information, which are propagated through networks by an improved iterative framework. The first vote is conducted on iteration results by voting methods, and the second vote is performed to get ensemble results of the first poll for the final driver gene list. Compared with four excellent previous approaches, our method has better performance in identifying driver genes on |$33$| types of cancer from The Cancer Genome Atlas. Meanwhile, we also conduct a comparative analysis about two kinds of mutation information, five gene interaction networks and four voting strategies. Our framework offers a new view for data integration and promotes more latent cancer genes to be admitted.

1 Introduction

Cancer is the second most frequent cause of death worldwide, killing more than |$8$| million people every year; the incidence of cancer is expected to increase by more than |$50$|% over the coming decades [1, 2]. On average, cancer genomes only contain |$4$||$5$| driver mutations, which play an important role in development of cancer, when combining coding and non-coding genomic elements [3]. One of the fundamental studies in the analysis and interpretation of cancer genomic data is to distinguish genes with driver mutations, i.e. driver genes. In terms of practical view, the discovery of driver genes is crucial for the prevention, diagnosis and clinical treatment of cancer. As far as the theory is concerned, finding these key driver genes can help us understand the mechanism of tumorigenesis and distinguish cancer subtypes better.

Over the past decade, researchers have come up with many models to identify driver genes based on some typical data. Thereinto, somatic mutation is so productive that it is almost an essential type of data for prioritizing driver genes [4]. Thanks to many publicly available databases, such as The Cancer Genome Atlas (TCGA) [5], somatic mutation data is easily accessed. Researchers have taken advantages of recurrence, functional impact (FI) and other information by mining mutation data for identifying driver genes. MutSigCV [6], based on recurrence information, takes three aspects of heterogeneity (samples, genes and mutation types) and assumes that background mutation rate of each tumor type is not consistent. OncodriveFML [7] applies Combined Annotation Dependent Depletion (CADD) and RNAsnp [8] to determine the magnitude of the impact of all mutations that occur in genomic elements. It regards genes with functional mutation (FM) bias as drivers.

Overview of our method. First, we collect mutation data of $33$ kinds of cancer from TCGA and 5 gene interaction networks from previous papers. Next, we represent mutation data in feature scores (Frequency, PolyPhen, SIFT) and reconstruct gene networks in undirected graphs. Then, represented mutation data combinations and reconstructed gene networks are regarded as inputs of an iterative framework to get multiple rankings. Finally, a two-stage vote is applied to rankings above for the final ranked gene list.
Fig. 1

Overview of our method. First, we collect mutation data of |$33$| kinds of cancer from TCGA and 5 gene interaction networks from previous papers. Next, we represent mutation data in feature scores (Frequency, PolyPhen, SIFT) and reconstruct gene networks in undirected graphs. Then, represented mutation data combinations and reconstructed gene networks are regarded as inputs of an iterative framework to get multiple rankings. Finally, a two-stage vote is applied to rankings above for the final ranked gene list.

In recent years, it is preferable to combine multiple kinds of information to find driver genes, especially mutation data incorporated with gene interaction networks. Mutual interactions between genes are popular due to the power to elucidate molecular mechanisms of cancer development at the network level [9, 10, 11, 12]. There are computational methods that benefit from simultaneous analysis of mutation and gene network data [13, 14, 15, 16]. DriverNet [17], OncoIMPACT [18] and SCS [19] use greedy optimization approaches on manually curated subnetworks to prioritize the minimal set of driver genes. For example, SCS ensembles somatic mutations and expression data with a reference molecular network to obtain driver mutation profiles in a personalized-sample manner. DawnRank [20] and MECoRank [21] adopt random walk algorithms to diffuse impact on the perturbation of downstream genes. Specifically, DawnRank ranks mutated genes in a single patient referring to PageRank [22] and implements ranking aggregation in a population by Condorcet voting [23]. DriverML [24] and MoProEmbeddings [25], two machine-learning algorithms, leverage the knowledge of well-established cancer driver genes to enable supervised prediction.

As mentioned above, various kinds of information are provided by mining mutation data. But it is open to discussion if it is more effective for prioritizing driver genes when integrating various types of mutation information based on gene interaction networks. Hence, a two-stage-vote ensemble framework is designed to incorporate results after propagating mutation information through networks by an improved iterative framework referring to PageRank [22], Hyperlink-Induced Topic Search (HITS) [26] and their variants. In this work, useful information from both mutation data and gene interaction networks are extracted at first. Mutation frequency and FI of top |$N$| significantly mutated genes are measured by feature vectors based on somatic mutation data of |$33$| types of cancer from TCGA. Undirected graphs are obtained by reconstruction of five gene interactions networks. Next, unlike PageRank and HITS, there are some modifications in our algorithm. Specifically, symmetric undirected graphs and initialized mutation feature vectors mentioned above are inputs to the iterative framework. In order to achieve better ensemble effect, four brilliant voting strategies (Borda voting, Geometric mean, HPA [27] and SetExpan [28]) are used to integrate results at last. A comprehensive evaluation of our method is executed using multiple benchmark measures against well-known driver datasets, such as Cancer Gene Census (CGC) [29], Network of Cancer Genes (NCG) [30] and Integrative Onco Genomics (IntOGen) [31].

2 Method

In this work, we build a two-stage-vote ensemble framework propagating varied types of mutation information in networks to identify more known cancer genes and the overview of our method is in Figure 1.

2.1 Mutation data representation

We get access to mutation data of |$33$| kinds of cancer from TCGA. Somatic mutations are composed of two types: (i) single-nucleotide variants (SNVs) and (ii) Insertions and Deletions (InDels). In the cancer context, diffusing a signal from genes that are somatically mutated across tumors is highly effective for identifying cancer-relevant genes and pathways [32, 33]. In contrast to frequency-based approaches, network propagation methods can even pinpoint rarely mutated driver genes if they are within subnetworks whose component genes, when considered together, are frequently mutated [34]. In addition, the FI of somatic mutations in the coding regions of genes is fairly straightforward to predict, much less is known about the effect of mutations on non-coding regions of the genome; only |$1$|% of somatic mutations detected in PCAWG WGS data are exonic [35].

2.1.1 Mutation recurrence

To sum up, we choose top |$N$| significantly mutated coding genes as the dataset for each kind of cancer. Specifically, mutation frequency for every gene across samples is equal to the proportion of mutations that occur in this gene to the total in corresponding MAF file, and the experimental dataset only includes top |$N$| of most predominately mutated coding genes. And frequency score vector is just composed of mutation recurrences of these |$N$| genes.

2.1.2 Functional impact

As a different signal of positive selection, FM bias is useful to uncover driver genes, including lowly recurrent driver genes, and gene modules [36]. So, we compute two kinds of FI scores according to related entries SIFT [37, 38] and PolyPhen [39], two well-known tools that assess FI of mutations, in each MAF file for selected top |$N$| genes. Specifically, we take the average of SIFT and PolyPhen scores of all alterations each gene bears across samples, respectively. Note that we follow the principles of Oncodrive-FM [36] that FM bias can only be computed for genes that have mutations in at least two samples.

The lowest SIFT and highest PPH2 scores (⁠|$0$| and |$1$|⁠, respectively) represent the most deleterious mutations. We attribute the highest values to alterations without scores. Particularly, we transfer SIFT scores into |$1$|-SIFT for the sake of easier calculation later.

2.2 Gene interaction network reconstruction

In this paper, there are five gene interaction networks being adopted: the network from Dawnrank (Dawnrank), HINT+HI2012, iRefIndex [40], MultiNet [41] and the network constructed by Wu et al. [42]. Dawnrank uses a variety of sources, including the network used in MEMo [43] as well as the up-to-date curated information from Reactome [44], the NCI-Nature Curated PID [45] and KEGG [46] to build the interaction network [20]. HINT+HI2012 is a combination of HINT network [47] and the HI-2012 [48] set of protein–protein interactions [33]. IRefIndex is obtained by parsing protein interaction records from interaction databases including BIND [49, 50], BioGrid [51] and IntAct [52, 53], etc. MultiNet integrates the diverse modes of gene interactions (regulatory, genetic, phosphorylation, signaling, metabolic and physical protein–protein interactions) to create a unified biological network. Wu’s network, covering nearly |$50$|% of the human proteome, integrates information from known pathways as well as interactions derived from computational predictions [18].

One of input data is a symmetric |$N$|-by-|$N$| binary mutation matrix. For five reference gene interaction networks applied in this paper, we deal with them in the same way. Consider a reference network |$G=(V,E)$|⁠, where |$V$| and |$E$| represent genes and their interactions, respectively. The corresponding genes of |$V$| are selected to be identical with the mutated genes in the experimental dataset of each kind of cancer (which limits the sizes of |$G$|⁠). Then an undirected graph |$W=(V^{\prime},E^{\prime})$| is obtained where each vertex represents a gene, and there is an edge between two vertices if an interaction has been found between the corresponding genes in the reference network |$G$|⁠. Specifically, if there is an interaction connecting gene |$i$| and gene |$j$| in |$G$|⁠, |$W_{i,j}=1$|⁠; otherwise, |$W_{i,j}=0$|⁠. And we assume that |$W_{i,i}=1$|⁠.

2.3 Iterative framework

PageRank and HITS have been proposed to rank vertices based on the graph structure. Specifically, these two algorithms are used to measure the importance of website pages and to rank web pages in their search engine results. PageRank estimates the importance score of vertices as the stationary distribution of a random walk process—starting from a vertex, the surfer randomly jumps to a neighbor according to the edge weight [54]. HITS algorithm is similar to PageRank in some aspects. This method assumes each vertex has two roles: hub and authority [26]. If a vertex is linked by many vertices with hub score, this vertex has high authority and vice versa [54].

While initial network approaches to identify disease genes focused on propagating knowledge from a set of known ‘gold standard’ disease genes, with the widespread availability of cancer sequencing data and genome-wide association studies, the source of where information is propagated from has shifted to genes that are newly identified as perhaps playing a role in disease [34]. In our model, we assume that coding genes with higher mutation frequency are the source of information and their information is able to be represented through diverse feature scores.

Specifically, we propose a model referring to PageRank, HITS and their variants. An iterative process is used to propagate information of somatic alterations in measurable ways, i.e. mutation frequency and FI scores, through interaction networks. Our model propagates information included in two feature vectors synchronously through the iteration of one undirected graph. Let the undirected graph be |$M$| and two feature vectors be |$A$|⁠, |$H$|⁠. The generalized equations can be written as
and
where |$A_{0}$| and |$H_{0}$| are initial values of two vectors and |$k$| is the |$k-th$| iteration. |$M$| is a |$N$|-by-|$N$| matrix and the sizes of |$A_{0}$| and |$H_{0}$| are both |$N$| dimensions. Except dependence on the iteration of |$M$|⁠, |$A_{k}$| and |$H_{k}$| are influenced by each other when |$k$| is odd; otherwise, they are determined by their own initial values. The iteration stops when the errors of |$A$|⁠, |$H$| between |$(k-1)-th$| and |$k-th$| are less than |$\theta $|⁠. In this paper, we set |$\theta $| as |$10^{-3}$|⁠.

Here, we combine Frequency, PolyPhen and SIFT scores in pair, which generates three feature combinations, i.e. Frequency_PolyPhen, Frequency_SIFT and PolyPhen_SIFT. Two elements in a combination are regarded as |$A_{0}$| and |$H_{0}$|⁠, respectively. Undirected graph |$M$| is provided by reconstructed DawnRank, HINT+HI2012, iRefIndex, MultiNet and Wu’s network. In a word, there are |$15$| diverse inputs produced by three feature combinations and five undirected graphs and updated |$A_{k}$| and |$H_{k}$| are two outputs for each input.

Analysis on the first vote. (A) Fraction of CGC genes in top $50$ of 3 feature combinations when applying different voting strategies. Each column corresponds to one kind of cancer and each row represents one feature combination applying one voting strategy, i.e. voting method: feature combination. ‘Fraction’ in the legend refers to the percentage of CGC genes in top 50. Note that Geometric represents the voting method Geometric mean. (B) Overlap of CGC gene sets predicted by 3 feature combinations in the voting method HPA. Each bar on the bottom left represents the number of all CGC genes in top $50$ of $33$ kinds of cancer predicted by each feature combination. Each panel on the right displays the intersection size of predicted CGC genes sets. (C) Fraction of CGC genes in top $50$ of 5 gene interaction networks in the voting method Geometric mean. Each panel includes results of all $33$ types of cancer and represents one gene interaction network.
Fig. 2

Analysis on the first vote. (A) Fraction of CGC genes in top |$50$| of 3 feature combinations when applying different voting strategies. Each column corresponds to one kind of cancer and each row represents one feature combination applying one voting strategy, i.e. voting method: feature combination. ‘Fraction’ in the legend refers to the percentage of CGC genes in top 50. Note that Geometric represents the voting method Geometric mean. (B) Overlap of CGC gene sets predicted by 3 feature combinations in the voting method HPA. Each bar on the bottom left represents the number of all CGC genes in top |$50$| of |$33$| kinds of cancer predicted by each feature combination. Each panel on the right displays the intersection size of predicted CGC genes sets. (C) Fraction of CGC genes in top |$50$| of 5 gene interaction networks in the voting method Geometric mean. Each panel includes results of all |$33$| types of cancer and represents one gene interaction network.

2.4 Voting strategy

In this paper, four voting strategies—Borda voting, Geometric mean, HPA and SetExpan—are carried out in order to determine whether voting strategies affect the drivers’ search, and if so, which one performs well.

Borda voting ranks candidates according to preference and regards the candidate with the most high rankings as the winner by giving each candidate a certain number of rankings. Geometric mean is conducted by multiplying the numbers altogether and taking the nth root of the multiplied numbers. Here these numbers are rankings of a candidate. HPA selects the top |$k$| ones that are nearest to th average value of all rankings, and weights them via the similarities between them and the average ranking. The final selection is equal to the weighted sum of the top |$k$| rankings. SetExpan collects |$T$| pre-ranked lists and scores each entity based on its mean reciprocal rank in all |$T$| rankings.

Here, we build a two-stage voting pipeline. First, we integrate corresponding iteration outcomes for Frequency_PolyPhen, Frequency_SIFT and PolyPhen_SIFT by the first vote, respectively, and produce three integrated ranked gene lists to study which feature combination is effective. Then, we get the ensemble of these three rankings by the second vote in order to get the final selection and detect novel driver genes.

3 Results

In this paper, comprehensive evaluation of our method is performed on mutation data of all |$33$| types of cancer from TCGA using multiple benchmark measures. First, we compare results after the first vote of different feature combinations, i.e. Frequency and PolyPhen, Frequency and SIFT, PolyPhen and SIFT. Then, we analyze the final ranking produced by the second vote from overlap of first vote, gold benchmark driver sets, voting strategies and other algorithms, four various perspectives. In addition, we summarize the novel predicted driver gene list and delve into several of these genes.

3.1 Analysis on mutation information

As mentioned in above section, we quantify mutation information by three kinds of feature scores, Frequency, PolyPhen and SIFT, for each gene in every cancer. Next, the scores are combined in pair: Frequency_PolyPhen, Frequency_SIFT and PolyPhen_SIFT, each of which and each of five reconstructed networks serve as inputs to the model together. Subsequently, four voting strategies, i.e. Borda voting, Geometric mean, HPA and SetExpan, are applied to integrate outputs of each feature combination cooperated with five networks. The fraction of CGC genes in top |$50$| contributes to reveal which combination is more effective for identifying driver genes.

Evaluation for the second vote. (A) Comparison on overlap of three combinations after the first vote and predicted driver genes after the second vote by the voting method SetExpan. Each panel represents fractions of CGC genes in top $50$ predicted for each kind of cancer by two sets. Blue panels match overlap of 3 feature combinations, while orange ones stand for results of the second vote. (B) Comparison of results after the second vote by four voting strategies. Each line represents one voting strategy and varies in $33$ types of cancer. Note that Geometric represents the voting method Geometric mean. (C) Comparison of results after the second vote in three benchmark datasets. Each panel includes results of all $33$ types of cancer and represents one benchmark driver set.
Fig. 3

Evaluation for the second vote. (A) Comparison on overlap of three combinations after the first vote and predicted driver genes after the second vote by the voting method SetExpan. Each panel represents fractions of CGC genes in top |$50$| predicted for each kind of cancer by two sets. Blue panels match overlap of 3 feature combinations, while orange ones stand for results of the second vote. (B) Comparison of results after the second vote by four voting strategies. Each line represents one voting strategy and varies in |$33$| types of cancer. Note that Geometric represents the voting method Geometric mean. (C) Comparison of results after the second vote in three benchmark datasets. Each panel includes results of all |$33$| types of cancer and represents one benchmark driver set.

The performance of the first vote by four voting strategies is shown in Figure 2A. Each row includes results of all |$33$| types of cancer and represents one feature combination when utilizing one voting strategy. These three combinations identify similar CGC genes in quantity while it seems that the combination of Frequency and SIFT perform better in max and median among |$33$| kinds of cancer, |$70$|% and |$54$|%, respectively. In addition, for all three combinations, fractions of identified CGC genes reach |$40$|% in most kinds of cancer. However, results of some cancers are less than |$35$|%, such as MESO and UVM. Generally, the quantities of coding genes and samples included in most types of cancer are large enough when we choose top |$N$| significantly mutated coding genes as the datasets. But there are exceptions. For example, there are only |$2967$| and |$1476$| coding genes in MESO and UVM. That may be why less CGC genes are found in these two kinds of cancer.

We explore three feature combinations above in the view of pan-cancer further. Specifically, we merge all CGC genes in top |$50$| of all |$33$| kinds of cancer for each combination, and then take intersection of the three predicted CGC genes sets. Figure 2B shows overlap when using the voting method HPA and others are included in Supplementary Figure S1. The overlap is large enough that we reckon information provided by three feature combinations after repetitive iterations is similar.

Comparison with four algorithms in three benchmark driver sets. Each group of panels corresponds to one benchmark driver set, in which every box includes results of all $33$ types of cancer and represents one algorithm.
Fig. 4

Comparison with four algorithms in three benchmark driver sets. Each group of panels corresponds to one benchmark driver set, in which every box includes results of all |$33$| types of cancer and represents one algorithm.

3.2 Analysis on gene interaction network

As one kind of vital information prioritizing driver genes, we also investigate impact of five individual networks on the effectiveness. Concretely, we integrate results of all three combinations for each network by four voting strategies, respectively. Figure 2C displays the voting results of utilizing Geometric mean and others are in Supplementary Figure S2. Each panel represents one reference network. Apparently, undirected graphs based on HINT+HI2012, iRefIndex and MultiNet find more CGC genes than those on the basis of DawnRank and Wu’network, medians of which three networks are |$50$|%, |$54$|% and |$36$|%. We discover that graphs obtained by dealing with the first three networks include less edges than those of the last two, which leads to a conclusion that our model may be more suitable for sparser networks. Note that gene interaction networks here have been selected to be identical with the mutated genes in the experimental dataset of each kind of cancer. Hence, it does not mean that original reference networks are sparse.

3.3 Overlap of first vote vs. second vote

Next, we perform the second vote to integrate ensemble ranked gene lists of three combinations using corresponding voting strategy. Top |$50$| are selected as our candidate driver genes. As shown in Figure 2B, there are |$106$| CGC genes sharing in three feature combinations when merging all CGC genes in top |$50$| of all |$33$| kinds of cancer for each combination. In the context of using four corresponding voting strategies, we classify the overlap into each kind of cancer and compare them with identified CGC genes after the second vote. Figure 3A displays the contrast of two sets in the condition of using the same voting method SetExpan and others are in Supplementary Figure S3. It is evident that performing the second vote on results of the first one prioritizes more CGC genes than only taking overlap of three feature combinations, which proves the necessity and rationality of the second vote. The same is true of other voting strategies. Therefore, we focus on analysis of the second vote in the following sections.

3.4 Comparison on four voting strategies

As shown in Figure 3B, there are dramatic variation between results of these |$33$| kinds of cancer. Some of them almost reach the proportion of |$~70$|%, such as |$68$|% for READ applying voting method HPA, while fractions of some are just about |$30$|%, such as |$34$|% for UVM utilizing Geometric mean. It indicates the heterogeneity of cancer. Comparatively speaking, SetExpan outstands relatively in quantity and consistency through comparison with other voting strategies. Hence, we do more research according to results of SetExpan in other benchmark driver sets and methods identifying driver genes.

3.5 Gold benchmark driver set

Due to the lack of a generally accepted gold standard (i.e. bona fide cancer driver genes), it has been difficult to determine which predictors performed best and which, if any, of the prediction tools performed adequately in previous studies [17, 20, 55]. However, three systematic benchmark sets are useful indicators of the excellence of tools for driver gene prediction. The CGC database manually curates a list of |$723$| genes whose mutations have been causally implicated in cancer [29]. It is widely acknowledged that a higher proportion of predictions in the CGC database indicates better performance [17, 20, 55]. Except CGC database, we also take NCG 6.0 database into consideration, which contains |$2,372$| cancer genes from manually curated publications. Besides these two databases, a new set of |$568$| driver genes have been recently reported by the IntOGen database. Overlap with the CGC, NCG and IntOGen gene lists is a benchmark for cancer driver genes.

As we can see, fraction of known cancer genes predicted for most of types of cancer hits over |$40$|% (Figure 3C) in CGC and IntOGen. But there is an outlier, i.e. UVM, whose fractions in two datasets above are |$36$|% and |$30$|%. In fact, available mutation data of UVM is lacking by contrast with other kinds of cancer, which may be the reason for its poor performance. It seems that proportion of NCG genes is much more considerable, and the overwhelming majority account for more than |$60$|%. Notably, the fraction of HNSC even reaches |$86$|%. In conclusion, the quantity of known cancer genes identified by our method is sufficient enough that we believe our method performs well in plentiful common types of cancer.

3.6 Comparison with other four methods

In this paper, we evaluate our method by comparing it with four excellent algorithms: DawnRank, MutSigCV, OncodriveFML and SCS. Their predictions of driver genes are obtained from DriverML. Figure 4 displays the proportion of predicted driver genes that are also presented in the CGC, IntOGen and NCG across the |$14$| kinds of cancer (specific cancer names are included in Supplementary Table 1) from TCGA database. Each panel represents one tool and is arranged in the order of its median fraction of predicted driver genes in three benchmark driver sets mentioned above. For a specific tool, fractions of its predicted drivers in three benchmark datasets vary among different cancer types. Our method ranks first and |$56$|%, |$57$|% and |$77$|% of its predicted candidate drivers belong to CGC, IntOGen and NCG, respectively. On the whole, our method and DawnRank are equally matched in terms of their median values: |$\sim $||$50$|% in CGC and IntOGen and |$\sim $||$70$|% in NCG. However, the fractions of predicted drivers in three databases are generally |$\lt $||$30$|% using SCS.

3.7 Novel driver gene analysis

It is vital to identify a core set of driver genes that are also predicted by several other methods [24]. The likelihood that predicted driver genes are actually associated with cancer increases with the number of tools that identify them, because false positives of one tool are likely to be discarded by other tools [55, 56]. Therefore, we acquire novel predicted driver gene list (Supplementary Table 2A) on the basis of non-CGC genes of our method and four methods above. Several prominent genes are investigated based on current literature reports using CarcerMine, a literature-mined database of drivers [57]. CancerMine extracts literature evidence of cancer genes, classifying them as drivers, oncogenes and tumor suppressors genes [13].

We count how many times each of non-CGC genes is predicted as drivers in all types of cancer. SOS1, which is regarded as the driver gene of half of |$14$| kinds of cancer [58, 59], is the most potential candidate in pan-cancer aspect. There are three papers related to this gene in CancerMine from 2018 to 2020. Researchers have found that SOS1, which is significantly mutated in lung adenocarcinoma, is an oncogene and that mutations in SOS1 are capable of driving tumor formation [60]. In addition, SOS1 plays an essential role in mediating the oncogenic effects of USP22 on gastric tumor growth, possibly via activating the RAS/ERK and PI3K/AKT pathways [61]. Due to its functional characteristics above, SOS1 may also participate in the development of other types of cancer, i.e. COAD, HNSC, KIRC, KIRP, UCEC, according our result. Hence, we have reason to believe that it is worthy of validating whether SOS1 influences progression of these tumors, and if so, how it induces cancer in patients.

In fact, we also do the same researches in top |$100$| and top |$150$| candidate driver genes to uncover more potential driver genes (Supplementary Table 2B, 2C). Although few non-CGC genes are shared by all five methods in every kind of cancer, there are three prominent non-CGC genes in PAAD. They are all predicted as cancer drivers by more than three methods. Specifically, EGR1 and SMARCC2 may be involved in the development of PAAD except aforementioned SOS1.

It has been reported that early growth response 1 (EGR1) is oncogenic in prostate cancer in 2005 [62] and other functions of EGR1 in cancer come to light from then on. For example, researchers reveal the stimulation of EGR1 transcriptional program through the RAS/MAPK signaling pathway in the study of endocrine gland cancer [63]. However, there is no paper to do with the role that EGR1 plays in PAAD in CancerMine database.

SMARCC2 is not included in CancerMine, but we seek out some researches by GeneCards [64], a searchable, integrative database that provides comprehensive information on all annotated and predicted human genes. It is reported that frameshift mutations in gastric and colorectal cancers would lead to premature stops of amino acid syntheses in SMARCC2 protein and resemble a typical loss-of-function mutation. Hence, SMARCC2 was hypothesized that frameshift mutations might contribute to tumorigenesis then [65]. Afterwards, researchers propose that SMARCB1-SMARCC2 subcomplex is required for the assembly and tumor suppression function of the BAF chromatin-remodeling complex [66]. Remarkably, tumor-suppression function of SMARCB1 has been proved and this gene is included in CGC databases. To sum up, SMARCC2 deserves more researches as one of potential driver genes in cancer.

4 Conclusion

In this work, a two-stage-vote ensemble model is designed to discover driver genes of |$33$| types of cancer based on mutation information, gene networks, iterative framework and voting strategy. By comparing with previous studies, results of the second vote demonstrate the effectiveness of our model. Except that, we conduct an additional analysis about feature combinations, gene interaction networks and voting strategies.

Nevertheless, there are also limitations in our work. One is that we do not assign weights to feature vectors when combining them in pair as the part of input. It is possible that the importance of mutation recurrence and FI is various in different kinds of cancer. What is more is that our model aims at finding driver genes in a population while individual tumors of the same type are heterogeneous. Therefore, it will be a bright choice to integrate weighted feature scores in a personalized manner.

Key Points
  • |$\bullet $| We propose a two-stage-vote ensemble framework to identify more known cancer genes. This model is more precise than four existing methods in three benchmark driver sets.

  • |$\bullet $| We conduct an analysis about feature combinations and gene interaction networks after the first vote and perform an evaluation for four various voting strategies after the second vote. The analysis discovers more suitable combination and networks for our model and the evaluation determines better voting method on the basis of final ranked gene lists.

  • |$\bullet $| We uncover a number of potent driver genes in the level of pan-cancer and individual types of cancer, roles some of which play in tumor even hang in doubt.

Funding

This work is supported by a grant from the National Natural Science Foundation of China (NSFC 61972280) and National Key R&D Program of China (2020YFA0908400).

Data Availability

The data and results are available from https://github.com/guofei-tju/Two-stage-vote-ensemble-framework.

Yingxin Kan is currently master degree candidate in Tianjin University. Her research interests include bioinformatics and machine learning.

Limin Jiang is a PhD candidate in Tianjin University. Her research interests include bioinformatics and machine learning.

Jijun Tang is a professor in University of South Carolina. His main research interests include computational biology and algorithm.

Yan Guo is an associate professor in University of New Mexico, Comprehensive Cancer Center. His main research interests include bioinformatics.

Fei Guo is a professor at Central South University. Her research interests include bioinformatics and computational biology.

References

1.

Bray
F
,
Ren
J
,
Masuyer
E
, et al.
Global estimates of cancer prevalence for 27 sites in the adult population in 2008
.
Int J Cancer
2013
;
132
(
5
):
1133
45
.

2.

Tarver
T
.
Journal of Consumer Health On the Internet
.
Atlanta, GA
:
American Cancer Society
,
2012
.

3.

The ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium
.
Pan-cancer analysis of whole genomes
.
Nature
2020
;
578
(
7793
):
82
.

4.

Kan
Y
,
Jiang
L
,
Tang
J
, et al.
A systematic view of computational methods for identifying driver genes based on somatic mutation data
.
Brief Funct Genomics
2021
;
20
(
5
):
333
43
.

5.

Cancer Genome Atlas Research Network
.
Comprehensive genomic characterization defines human glioblastoma genes and core pathways
.
Nature
2008
;
455
(
7216
):
1061
.

6.

Lawrence
MS
,
Stojanov
P
,
Polak
P
, et al.
Mutational heterogeneity in cancer and the search for new cancer- associated genes
.
Nature
2013
;
499
(
7457
):
214
8
.

7.

Mularoni
L
,
Sabarinathan
R
,
Deu-Pons
J
, et al.
Oncodrivefml: a general framework to identify coding and noncoding regions with cancer driver mutations
.
Genome Biol
2016
;
17
(
1
):
128
.

8.

Sabarinathan
R
,
Tafer
H
,
Seemann
SE
, et al.
RNA snp: efficient detection of local RNA secondary structure changes induced by SNP s
.
Hum Mutat
2013
;
34
(
4
):
546
56
.

9.

Ozturk
K
,
Dow
M
,
Carlin
DE
, et al.
The emerging potential for network analysis to inform precision cancer medicine
.
J Mol Biol
2018
;
430
(
18
):
2875
99
.

10.

Liu
X
,
Wang
Y
,
Ji
H
, et al.
Personalized characterization of diseases using sample-specific networks
.
Nucleic Acids Res
2016
;
44
(
22
):
e164
4
.

11.

Yu
X
,
Zhang
J
,
Sun
S
, et al.
Individual-specific edge-network analysis for disease prediction
.
Nucleic Acids Res
2017
;
45
(
20
):
e170
0
.

12.

Paull
EO
,
Carlin
DE
,
Niepel
M
, et al.
Discovering causal pathways linking genomic events to transcriptional states using tied diffusion through interacting events (TieDIE)
.
Bioinformatics
2013
;
29
(
21
):
2757
64
.

13.

Cutigi
JF
,
Evangelista
RF
,
Ramos
RH
, et al. Combining mutation and gene network data in a machine learning approach for false-positive cancer driver gene discovery. In:
Advances in Bioinformatics and Computational Biology, 13th Brazilian Symposium on Bioinformatics, BSB
.
São Paulo, Brazil
:
Springer International Publishing, Cham
,
2020
.

14.

Liu
H
,
Ren
G
,
Chen
H
, et al.
Predicting lncRNA–miRNA interactions based on logistic matrix factorization with neighborhood regularized
.
Knowl Based Syst
2020
;
191
:105261.

15.

Zhang
L
,
Yang
P
,
Feng
H
, et al.
Using network distance analysis to predict lncRNA–miRNA interactions
.
Interdiscip Sci
2021
;
191
:
1
11
.

16.

Chen
L
,
Li
J
,
Chang
M
.
Cancer diagnosis and disease gene identification via statistical machine learning
.
Curr Bioinform
2020
;
15
(
9
):
956
62
.

17.

Bashashati
A
,
Haffari
G
,
Ding
J
, et al.
Drivernet: uncovering the impact of somatic driver mutations on transcriptional networks in cancer
.
Genome Biol
2012
;
13
(
12
):
R124
.

18.

Bertrand
D
,
Chng
KR
,
Sherbaf
FG
, et al.
Patient-specific driver gene prediction and risk assessment through integrated network analysis of cancer omics profiles
.
Nucleic Acids Res
2015
;
43
(
7
):
44
.

19.

Guo
W-F
,
Zhang
S-W
,
Liu
L-L
, et al.
Discovering personalized driver mutation profiles of single samples in cancer by network control strategy
.
Bioinformatics
2018
;
34
(
11
):
1893
903
.

20.

Hou
JP
,
Jian
M
.
DawnRank: discovering personalized driver genes in cancer
.
Genome Med
2014
;
6
(
7
):
56
.

21.

Hui
Y
,
Wei
PJ
,
Xia
J
, et al.
Mecorank: cancer driver genes discovery simultaneously evaluating the impact of SNVs and differential expression on transcriptional networks
.
BMC Med Genomics
2019
;
12
(
S7
):
1
10
.

22.

Brin
S
,
Page
L
.
The anatomy of a large-scale hypertextual web search engine
.
Computer Networks and ISDN Systems
1998
;
30
(
1–7
):
107
17
.

23.

Pihur
V
,
Datta
S
,
Datta
S
.
Finding common genes in multiple cancer types through meta-analysis of microarray experiments: a rank aggregation approach
.
Genomics
2008
;
92
(
6
):
400
3
.

24.

Han
Y
,
Yang
J
,
Qian
X
, et al.
Driverml: a machine learning algorithm for identifying driver genes in cancer sequencing studies
.
Nuclc Acids Res
2019
;
8
:
e45
5
.

25.

Gumpinger
AC
,
Lage
K
,
Horn
H
, et al.
Prediction of cancer driver genes through network-based moment propagation of mutation scores
.
Bioinformatics
2020
;
36
(
Supplement_1
):
i508
15
.

26.

Kleinberg
JM
. Authoritative sources in a hyperlinked environment. In:
Proceedings of the 9th Annual ACM-SIAM Symposium on Discrete Algorithms
.
San Francisco, California
:
Association for Computing Machinery
,
1998
,
668
77
.

27.

Fujita
S
,
Kobayashi
H
,
Okumura
M
. Unsupervised Ensemble of Ranking Models for News Comments Using Pseudo Answers. In:
Advances in Information Retrieval, 42nd European Conference on IR Research, ECIR 2020
.
Lisbon, Portugal
:
Springer International Publishing, Cham
,
April 14–17, 2020
,
Proceedings, Part II, 2020
.

28.

Shen
J
,
Wu
Z
,
Lei
D
, et al. Setexpan: corpus-based set expansion via context feature selection and rank ensemble. In:
Joint European Conference on Machine Learning and Knowledge Discovery in Databases
.
Skopje, Macedonia
:
Springer
,
2017
,
288
304
.

29.

Futreal
PA
,
Coin
L
,
Marshall
M
, et al.
A census of human cancer genes
.
Nat Rev Cancer
2004
;
4
(
3
):
177
83
.

30.

Repana
D
,
Nulsen
J
,
Dressler
L
, et al.
The network of cancer genes (ncg): a comprehensive catalogue of known and candidate cancer genes from cancer sequencing screens
.
Genome Biol
2019
;
20
(
1
):
1
12
.

31.

Martínez-Jiménez
F
,
Muios
F
,
Sentís
I
, et al.
A compendium of mutational cancer driver genes
.
Nat Rev Cancer
2020
;
20
(
10
):
555
72
.

32.

Vandin
F
,
Upfal
E
,
Raphael
BJ
.
Algorithms for detecting significantly mutated pathways in cancer
.
J Comput Biol
2011
;
18
(
3
):
507
22
.

33.

Leiserson
MDM
,
Vandin
F
,
Wu
H-T
, et al.
Pan-cancer network analysis identifies combinations of rare somatic mutations across pathways and protein complexes
.
Nat Genet
2015
;
47
(
2
):
106
14
.

34.

Hristov
BH
,
Chazelle
B
,
Singh
M
. A Guided Network Propagation Approach to Identify Disease Genes that Combines Prior and New Information. In:
Schwartz
R
(ed).
Res Comput Mol Biol
.
Springer, Cham
,
2020
,
251
2
.

35.

Shuai
S
,
Abascal
F
,
Amin
SB
, et al.
Combined burden and functional impact tests for cancer driver discovery using driverpower
.
Nat Commun
2020
;
11
(
1
):
1
12
.

36.

Abel
GP
,
Nuria
LB
.
Functional impact bias reveals cancer drivers
.
Nucleic Acids Res
2012
;
40
(
21
):e169.

37.

Ng
PC
,
Henikoff
S
.
Sift: predicting amino acid changes that affect protein function
.
Nucleic Acids Res
2003
;
31
(
13
):
3812
4
.

38.

Kumar
P
,
Henikoff
S
,
Ng
PC
.
Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm
.
Nat Protoc
2009
;
4
(
8
):
1073
81
.

39.

Adzhubei
IA
,
Schmidt
S
,
Peshkin
L
, et al.
A method and server for predicting damaging missense mutations
.
Nat Methods
2010
;
7
(
4
):
248
9
.

40.

Razick
S
,
Magklaras
G
,
Donaldson
IM
.
iRefIndex: a consolidated protein interaction database with provenance
.
BMC Bioinform
2008
;
9
(
1
):
1
19
.

41.

Ekta
K
,
Yao
F
,
Jieming
C
, et al.
Interpretation of genomic variants using a unified biological network approach
.
PLoS Comput Biol
2013
;
9
(
3
):e1002886.

42.

Wu
G
,
Xin
F
,
Stein
L
.
A human functional protein interaction network and its application to cancer data analysis
.
Genome Biol
2010
;
11
(
5
):
1
23
.

43.

Ciriello
G
,
Cerami
E
,
Sander
C
, et al.
Mutual exclusivity analysis identifies oncogenic network modules
.
Genome Res
2011
;
22
(
2
):
398
406
.

44.

David
C
,
Gavin
O
,
Wu
G
, et al.
Reactome: a database of reactions, pathways and biological processes
.
Nucleic Acids Res
2011
;
39
(
Database issue
):D691.

45.

Schaefer
CF
,
Kira
A
,
Shiva
K
, et al.
Pid: the pathway interaction database
.
Nucleic Acids Res
2009
;(
suppl_1
):
674
9
.

46.

Minoru
K
,
Susumu
G
,
Yoko
S
, et al.
Kegg for integration and interpretation of large-scale molecular data sets
.
Nucleic Acids Res
2012
;
40
(
D1
):
D109
14
.

47.

Das
J
,
Hint
HY
.
High-quality protein interactomes and their applications in understanding human disease
.
BMC Syst Biol
2012
;
6
(
1
):
1
12
.

48.

Yu
H
,
Tardivo
L
,
Tam
S
, et al.
Next-generation sequencing to generate interactome datasets
.
Nat Methods
2011
;
8
(
6
):
478
80
.

49.

Bader
GD
,
IanD
CW
, et al.
BIND: the biomolecular interaction network database
.
Nucleic Acids Res
2003
;
29
(
1
):
242
.

50.

Alfarano
C
,
Andrade
C
,
Anthony
K
, et al.
The biomolecular interaction network database and related tools 2005 update
.
Nucleic Acids Res
2005
;
33
(
suppl_1
):
D418
24
.

51.

Stark
C
,
Breitkreutz
BJ
,
Reguly
T
, et al.
BioGRID: a general repository for interaction datasets
.
Nucleic Acids Res
2006
;
34
(
suppl_1
):
D535
9
.

52.

Kerrien
S
,
Alam-Faruque
Y
,
Aranda
B
, et al.
Intact-open source resource for molecular interaction data
.
Nucleic Acids Res
2007
;
35
(
suppl_1
):
D561
5
.

53.

Hermjakob
H
,
Montecchi-Palazzi
L
,
Lewington
C
, et al.
IntAct: an open source molecular interaction database
.
Nucleic Acids Res
2004
;
32
(
suppl_1
):
D452
5
.

54.

He
X
,
Ming
G
,
Kan
MY
, et al.
Birank: towards ranking on bipartite graphs
.
IEEE Trans Knowl Data Eng
2016
;
29
(
1
):
57
71
.

55.

Tokheim
C
,
Papadopoulos
N
,
Kinzler
KW
, et al.
Evaluating the evaluation of cancer driver genes
.
Proc Natl Acad Sci U S A
2016
;
113
(
50
):
14330
5
.

56.

Tamborero
D
,
Gonzalez-Perez
A
,
Perez-Llamas
C
, et al.
Comprehensive identification of mutational cancer driver genes across 12 tumor types
.
Sci Rep
2013
;
3
:
2650
.

57.

Lever
J
,
Zhao
EY
,
Grewal
J
, et al.
Cancermine: a literature-mined resource for drivers, oncogenes and tumor suppressors in cancer
.
Nat Methods
2019
;
16
(
6
):
505
7
.

58.

Li
J
,
Chang
M
,
Gao
Q
, et al.
Lung cancer classification and gene selection by combining affinity propagation clustering and sparse group lasso
.
Curr Bioinform
2020
;
15
(
7
):
703
12
.

59.

Zhuang
J
,
Dai
S
,
Zhang
L
, et al.
Identifying breast cancer-induced gene perturbations and its application in guiding drug repurposing
.
Curr Bioinform
2020
;
15
(
9
):
1075
89
.

60.

Cai
D
,
Choi
PS
,
Gelbard
M
, et al.
Identification and characterization of oncogenic sos1 mutations in lung adenocarcinoma
.
Mol Cancer Res
2019
;
17
(
4
):
1002
12
.

61.

Lim
CC
,
Xu
JC
,
Chen
TY
, et al.
Ubiquitin-specific peptide 22 acts as an oncogene in gastric cancer in a son of sevenless 1-dependent manner
.
Cancer Cell Int
2020
;
20
(
1
):
1
14
.

62.

Baron
V
,
Adamson
ED
,
Calogero
A
, et al.
The transcription factor Egr1 is a direct regulator of multiple tumor suppressors including TGF β 1, PTEN, p53, and fibronectin
.
Cancer Gene Ther
2006
;
13
(
2
):
115
24
.

63.

Zhang
M
,
Vandana
JJ
,
Lacko
L
, et al.
Modeling cancer progression using human pluripotent stem cell-derived cells and organoids
.
Stem Cell Res
2020
;
49
:102063.

64.

Stelzer
G
,
Rosen
N
,
Plaschkes
I
, et al.
The genecards suite: from gene data mining to disease genome sequence analyses
.
Curr Protoc Bioinformatics
2016
;
54
(
1
):
1
30
.

65.

Kim
SS
,
Kim
MS
,
Yoo
NJ
, et al.
Frameshift mutations of a chromatin-remodeling gene smarcc2 in gastric and colorectal cancers with microsatellite instability
.
Acta Pathol Microbiol Immunol Scand
2013
;
2013–1212
(
2
):
168
9
.

66.

Chen
G
,
Zhou
H
,
Liu
B
, et al.
A heterotrimeric smarcb1–smarcc2 subcomplex is required for the assembly and tumor suppression function of the baf chromatin-remodeling complex
.
Cell Discov
2020
;
6
(
1
):
1
5
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)