Abstract

Motivation

Medical genomics faces significant challenges in interpreting disease phenotype and genetic heterogeneity. Despite the establishment of standardized disease phenotype databases, computational methods for predicting gene–phenotype associations still suffer from imbalanced category distribution and a lack of labeled data in small categories.

Results

To address the problem of labeled-data scarcity, we propose a self-supervised learning strategy for gene–phenotype association prediction, called SSLpheno. Our approach utilizes an attributed network that integrates protein–protein interactions and gene ontology data. We apply a Laplacian-based filter to ensure feature smoothness and use self-supervised training to optimize node feature representation. Specifically, we calculate the cosine similarity of feature vectors and select positive and negative sample nodes for reconstruction training labels. We employ a deep neural network for multi-label classification of phenotypes in the downstream task. Our experimental results demonstrate that SSLpheno outperforms state-of-the-art methods, especially in categories with fewer annotations. Moreover, our case studies illustrate the potential of SSLpheno as an effective prescreening tool for gene–phenotype association identification.

Availability and implementation

https://github.com/bixuehua/SSLpheno.

1 Introduction

In clinical practice, phenotypes are used to describe deviations from normal morphology, physiology, or behavior (Robinson 2012). Clinicians use phenotypes to identify abnormal conditions through historical or examination data. From a genetics perspective, genes are closely related to the formation of disease phenotype. In recent years, a significant number of candidate genes linked with disease phenotypes have been discovered (Bastarache et al. 2022, Horowitz et al. 2022). These discoveries have greatly improved our understanding of the genetic basis of diseases, as well as the development of new drugs and novel methods for disease treatment. However, the connection between genes and phenotypes is not straightforward. Typically, mutations in the same gene can cause multiple syndromes, and mutations in different genes can cause the same disorder (Oti and Brunner 2007). Therefore, significant challenges remain in uncovering the links between genes and phenotypes.

Traditional methods for locating associations between genes and phenotypes, such as linkage mapping and genome-wide association studies (GWAS) (Beck et al. 2020), are based on the hypothesis of “common disease, common variation,” which does not account for all disease risk. With the proposal of the Human Phenome Project, standardized human phenotypes databases, such as the Human Phenotype Ontology (HPO) (Köhler et al. 2021), have facilitated the development of computational methods for predicting gene–phenotype associations (Xue et al. 2019a, Chen et al. 2021, Yuan et al. 2022b). These methods use large amounts of phenotypes and gene-related data and employ various algorithms to predict disease-causing genes, showing promising performance.

Numerous studies (Köhler et al. 2009, Hoehndorf et al. 2011, Smedley et al. 2013) have attempted to establish gene–phenotype associations by calculating the phenotype similarity, which are based on the hypothesis that similar or interacting genes are related with similar phenotypes. However, these approaches are limited due to the incompleteness of known gene–phenotype associations. Gene–phenotype data from model organisms can compensate for lack of human data and increase genome coverage. Researchers have looked to comparing phenotype traits across species to predict the human gene–phenotype associations (Washington et al. 2009, Wang et al. 2013, Bone et al. 2016). The use of model organisms can expand the range of candidate genes, but only genetic and phenotypic information from direct homologs and similar anatomical structures can provide valuable reference (Alghamdi et al. 2022). Anyway, these similarity-based methods have produced multiple similarity measures (Kulmanov et al. 2021), which provide diverse ideas for the constructing similarity networks in network-based methods.

Information propagation methods, such as label propagation or random walking methods, rely on constructed network to predict gene–phenotype associations (Xie et al. 2015, Petegrosso et al. 2017). Gene–gene associations (GGAs) are typically represented by genes as nodes and functional associations as edges, which can be described by gene co-expression network (Luo et al. 2007) and genetic interaction (Hu et al. 2021). Protein–protein interaction (PPI) networks are important types of GGAs that capture gene relationships (Xiang et al. 2022). The quality of the constructed network and its construction method can affect the prediction performance of the model, as false positive data can introduce additional bias to both the protein network and phenotype similarity matrix (Ranea et al. 2022). Moreover, Dual Label Propagation (DLP) was introduced to predict the associations of genes with their most specific annotated phenotypes in HPO, by simultaneously reconstructing GO term–gene associations and HPO phenotype–gene associations (Petegrosso et al. 2017). However, these methods may not be able to predict the phenotypic annotation of isolated points in the network.

Deep learning has also shown powerful capability in bioinformatics due to the ability to automatically extract features from complex biological data (Peng et al. 2021, Yuan et al. 2022a, Zhang et al. 2022). This technology has been widely applied in various areas of bioinformatics, including the analysis of genes and diseases from different omics data, such as genome (Yuan and Bar-Joseph 2019), transcriptome (Wang et al. 2023b), proteome (Senior et al. 2020), and metabolome (Wang et al. 2022a). Researchers have applied deep learning to predict gene–phenotype associations (Kulmanov and Hoehndorf 2020, Pourreza Shahri and Kahanda 2021) and regard gene–phenotype prediction as a multi-label classification task, using GGA network or gene functional information to predict phenotypes. For example, Kulmanov and Hoehndorf (2020) developed DeepPheno, a neural network-based hierarchical multi-label classification method for predicting abnormal phenotypes of a group of genes with missing functional annotations. Pourreza Shahri et al. (Pourreza Shahri and Kahanda 2021) proposed a deep semi-supervised fusion framework PPPredSS, which combined deep neural networks (DNNs), semi-supervised and integrated learning to predict protein–phenotype relationships in extended unlabeled datasets based on a small labeled dataset. Since phenotypes arise from molecular network aberrations and physiological interactions within and between cells, tissues, and organs, knowledge about molecular interactions is crucial. GNN-based prediction model, such as HPOFiller (Liu et al. 2021), HPODNets (Liu et al. 2022b), and GraphPheno (Liu et al. 2022c), have a strong advantage in predicting phenotypes from molecular networks. HPOFiller constructed a PPI network, an HPO similarity network and a protein–phenotype bipartite graph, and then iteratively redefined the network representation in each of the three networks with GCN to obtain the protein–phenotype relationships. HPODNets learned the high-order information of proteins in the multi-source PPI networks by using eight GCNs, and then fused the low-dimensional vectors to obtain the HPO prediction results in an end-to-end manner. GraphPheno integrated human protein sequences and PPIs and employed Graph Autoencoder to identify novel protein–phenotype associations.

Although deep learning methods have made significant progress in predicting gene–phenotype associations, these methods require large-scale labeled data for training. The main challenge remains that the role of a large number genes for abnormal phenotypes is still unknown. The HPO database as of March 2023 has only annotated phenotype for 4895 genes. This number is relatively small compared to the 25 000 human genes, which hinders the performance improvement of supervised deep learning in gene–phenotype association prediction.

To address the scarcity of labeled data, self-supervised learning (SSL) methods have been introduced for many bioinformatics tasks, such as molecular property prediction (Zheng et al. 2023), compound–protein affinity and contact prediction (You and Shen 2022). Recently, SSL on graphs has also made great progress (Liu et al. 2022d). Graph contrastive learning (GCL) utilizes the data augmentation methods to learn feature embedding of graphs (Qiu et al. 2020). KAUR (Ma et al. 2023) proposed a knowledge-based GCL, regarding nodes and their semantic neighbor information as positive contrastive pairs to enhance the node representations. CasANGCL (Zheng et al. 2023) adopted three augmentation methods, attribute masking, edge perturbation, and subgraph extraction, to construct positive and negative samples for molecular property prediction. ATMOL (Liu et al. 2022a) constructed an attention-wise masking module by masking the nodes with high attention weights to produce the augmented view. The performance of GCLs heavily relies on well-designed data augmentation strategies. Therefore, it is crucial to design an appropriate augmentation method for SSL.

Inspired by these SSL methods, we introduce an SSL-based pre-training model, named SSLpheno, to improve the gene–phenotype association prediction. The detailed process of SSLpheno is shown in Fig. 1. Firstly, by integrating PPIs and gene ontology, we construct a gene attributed network for pre-training, and design a filter to smooth the nodes features. Next, we generate the labels of the SSL training set by computing the cosine similarity of the preprocessed node features. During training, we iteratively select the set of positive and negative samples to enhance the feature representation of nodes. Finally, we use supervised learning DNN to predict the gene–phenotype associations by multi-label classification. Experimental results demonstrate that our SSLpheno model outperforms state-of-the-art methods for gene–phenotype association prediction. Ablation study and experiments under GGAs show that the design of SSLpheno is reasonable. We also conduct case studies on the phenotype and gene sets predicted by our model. The results indicate the practical value of SSLpheno in gene–phenotype association prediction.

The workflow of SSLpheno. ① Attributed network construction and feature preprocessing. ② Self-supervised learning for pre-training. ③ Multi-label classification for phenotype prediction.
Figure 1.

The workflow of SSLpheno. ① Attributed network construction and feature preprocessing. ② Self-supervised learning for pre-training. ③ Multi-label classification for phenotype prediction.

2 Materials and methods

2.1 Datasets

We utilize two datasets to facilitate model comparison, which are constructed from the HPO (Schriml et al. 2022) and the DisGeNET (Piñero et al. 2020). The HPO database provides ontologies of human disease phenotype and reliable associations between genes and phenotypes. We download the gene–phenotype associations released on 21 August 2020, which contains 4455 genes, 8093 phenotypes, and 176 559 annotations from https://hpo.jax.org. Since the directed acyclic graph structure of the phenotype ontology follows the true-path-rule, we perform annotation propagation on the obtained gene–phenotype relationships. To map genes to proteins, we use the mapping tool in Uniport/Swissport (https://www.uniprot.org/id-mapping) and select the genes that map to manually reviewed and annotated proteins. Then, we retain the terms in the Phenotype Abnormal branch of the phenotype ontology followed the previous works (Kulmanov and Hoehndorf 2020, Liu et al. 2021) to ensure the fairness of the comparison with the baseline models. Furthermore, we remove terms with annotation counts of ≤10, similar to Graphpheno, HPODNets, and HPOFiller, which is the partial evaluation mode proposed by CAFA2 (Jiang et al. 2016) to ensure the model stability. In total, we get a dataset, referred to as “HPO_2020,” containing 3709 genes, 3895 phenotypes and 525 056 annotations with an average of 141.6 annotations per gene. We divide the phenotype terms in the dataset into four groups, according to the number of annotated genes: “11–30,” “31–100,” “101–300,” and “>300” (Xue et al. 2019b, Liu et al. 2022b). DisGeNET is a platform providing abundant information about human disease-associated genes and variants. By using the disgenet2r package (Piñero et al. 2020), we filter data where the disease type is “phenotype” to generate gene–phenotype association dataset. At last, we get 1 657 434 gene–phenotype associations after the same preprocessing as for the “HPO_2020,” involving 15 271 genes and 4755 phenotypes with an average of 108.5 annotations per gene. Table 1 shows the statistic of the two datasets, grouped by the number of genes annotated with each phenotype.

Table 1.

Statistic of genes, phenotype terms, and annotations in datasets.

DatasetsGenesPhenotype terms
Avg. annotations per gene
11–3031–100101–300≥301
HPO_2020370916241235729441141.6
DisGeNET15 27115521370879954108.5
DatasetsGenesPhenotype terms
Avg. annotations per gene
11–3031–100101–300≥301
HPO_2020370916241235729441141.6
DisGeNET15 27115521370879954108.5
Table 1.

Statistic of genes, phenotype terms, and annotations in datasets.

DatasetsGenesPhenotype terms
Avg. annotations per gene
11–3031–100101–300≥301
HPO_2020370916241235729441141.6
DisGeNET15 27115521370879954108.5
DatasetsGenesPhenotype terms
Avg. annotations per gene
11–3031–100101–300≥301
HPO_2020370916241235729441141.6
DisGeNET15 27115521370879954108.5

2.2 Proposed methods

SSLpheno is constructed followed by three steps: Firstly, we construct the gene attributed network with PPIs and gene ontology, and design a filter based on Laplacian smoothing filter to denoise the high-frequency components of the feature matrix. In the second step, SSL is used for pre-training, during which rich feature representations are learned through a well-designed pretext task to provide more generalized gene feature for the downstream task. We finally feed the pre-trained features into a multi-layer neural network to predict phenotypes.

2.2.1 Attributed network construction and feature preprocessing

Genes and their products participate in various life activities of organisms mainly through direct interaction or indirect synergistic interaction, collectively known as GGAs (Guala et al. 2020). In this study, we construct GGAs by using PPIs from STRING (Szklarczyk et al. 2021) database (https://string-db.org/) (v11.0). The “combined score” provided by STRING is scaled to [0,1]. Following the works (Fan et al. 2020, Liu et al. 2022c), we set the edge between their mapped genes to 1 if the score of two proteins is not <0.3, and 0 otherwise, to ensure that we only use highly confident interactions. Considering that functionally related genes produce similar phenotypes (Claussnitzer et al. 2020), we use GO annotations as the attributes of nodes in the GGAs. The GO annotations are obtained from Uniport/Swissport and comprised 13 778 classes. We encoded the GO terms with a 13 778 dimensional binarized vector, where the i-th bit being 1 indicates having the i-th GO annotation. Considering the difficulty of high-dimensional sparse feature in graph representation learning, we use principal component analysis (PCA) to reduce the dimensionality and eliminate data sparsity. Finally, we obtain the attributed network used in this study, which is formally defined as: G={V,E,X}, where V={v1,v2,,vn} is nodes set of the graph, vi denotes gene i and n is the number of nodes in the graph. E is the edges set, and each edge represents the relationship between two genes. X=[x1,x2,,xn]T is the attribute matrix, where xiRm denotes the attribute representation of gene node vi. The adjacency matrix can be defined as A={aij|1i,jn}R2, where aij=1 if (vi,vj)E. After PCA processing, the attribute original dimension m is reduced from 13 778 to 1000, which is optimal by experimental verification, as shown in Supplementary Fig. S5.

Given the constructed attributed network, a feature filter based on Laplacian is designed for feature smoothing. The embedding of nodes in the attributed graph should consider both the similarity of node attributes and the consistency of graph topology (Huang et al. 2017). To preserve self-information, we define the adjacency matrix with added self-connections as A˜=I+A, where I is the identity matrix. D˜=diag(d˜1,d˜2,,d˜n)Rn×n denotes the degree matrix of A˜, where d˜i=vjVa˜ij is the degree of node vi. L˜=D˜+A˜ is Laplacian matrix corresponding to A˜. We employ the symmetric normalized graph Laplacian, L˜sym=D˜12L˜D˜12. The filter based on generalized Laplacian smoothing filter is defined as

(1)

There are different choices for α in different works (Wang et al. 2019, Cui et al. 2020), we set α=2/3 according to the experimental results shown in Supplementary Fig. S6. After preprocessing the gene feature matrix through t defined filters, we obtained the filtered feature matrix as

(2)

2.2.2 SSL for pre-training

The purpose of pre-training is to generate enhanced low-dimensional embedding representations for nodes in graph (Hu et al. 2019). Given the filtered node features X˜, we encode the node embeddings using a linear encoder

(3)

where W is the weight matrix. The min. to max. scaler is used to scale the embeddings to [0,1].

The adjacency matrix is usually chosen as the real label of node pairs in Graph Auto Encoder (GAE) (Kipf and Welling 2016). However, this method only records one-hop structure information, which is insufficient for attributed graph embedding. Inspired by the Adaptive Graph Encoder (Cui et al. 2020), we define the labels of the gene pairs using their similarity. Specifically, we first compute the cosine similarity of gene pairs S={sij|1i,jn}Rn2, which is defined as

(4)

where Z is the gene features after encoding by the linear encoder, which comes from Equation (3). Then, we rank the similarity sequence and select positive and negative samples from it for training. We select the top k gene pairs (vi,vj) with the highest similarity as positive samples and k pairs of genes with the lowest similarity value are selected as negative samples. Finally, the training set consisting of 2k gene pairs are obtained.

To train the gene pairs auto-encoder, we used cross-entropy as the loss function:

(5)

where yself is the training label of the gene pair (vi,vj), which is set to 1 if the gene pair is positive, and 0 if the gene pair is negative, and sij is the similarity of the gene pair (vi,vj). After the SSL, we get the final gene representations Z={z1,z2,,zn}Rn, where zi is the trained feature vector of gene vi, and its dimension is experimentally set to 2048. The details are shown in Supplementary Fig. S7.

2.2.3 DNN classifier for prediction

DNN has powerful performance on supervised learning tasks (LeCun et al. 2015). We use DNN as the classifier for predicting phenotype. The input is the gene feature vectors trained by the pre-trained model, with three hidden layers are stacked between the input and output layers. The output layer consists of 3895 phenotype classes. In each layer, we first perform a dense function f() and Batch Normalization on the input features, and then conduct LeakyReLU activation. Each layer is defined as

(6)

where Zl is the input of the l-th layer, and Z(l+1) is the output of the l-th layer.

For the last output layer, we use a sigmoid activation function σ(). During training, the gradient descent algorithm is used, along with the cross-entropy function. The DNN classifier loss function to be optimized is as follows

(7)

where ycla is the label of gene in training set and p is the predicted probability value. We use a batch size of 128 and an initial learning rate 0.001.

2.3 Parameters

SSLpheno has been implemented using PyTorch and trained using the Adam optimizer. We utilize an empirical and heuristic approaches for hyper-parameter tuning through grid-search. The default hyper-parameter settings, the search scopes and the optimized values are summarized in Supplementary Table S2. For all subsequent experiments, we adapt the optimal hyper-parameter settings.

3 Results and discussion

3.1 Evaluation criteria

In this work, we employ the area under the precision–recall curve (AUPR), F1 score (F1) as evaluation metrics. They are the commonly used metrics for the gene–phenotype association prediction (Kulmanov and Hoehndorf 2020, Liu et al. 2021, 2022b).

Due to the imbalanced class problems in the experimental datasets, we evaluate the prediction performance under two scenarios, similar to previous works (Liu et al. 2022b): (i) macro-averaged, denoted as M, where metrics are calculated for each term, and the average is determined, and (ii) micro-averaged, denoted as m, where the predicted score matrix for each gene–phenotype term pair is vectorized, and metrics are calculated based on the resulting vector. The best result is displayed in bold, while the second-best is underlined.

3.2 Prediction performance with baseline methods

We have compared our method with six state-of-the-art baseline methods, including deep learning-based methods GraphPheno (Kulmanov and Hoehndorf 2020), HPODNets (Liu et al. 2022b), and HPOFiller (Liu et al. 2021), two classical methods tlDLP (Petegrosso et al. 2017) and BiRW (Xie et al. 2015) for gene–phenotype association prediction, and another method proposed by ourselves employing GAE for pre-training. To demonstrate the models’ generalization ability, we compare the prediction performance on the two datasets mentioned in Section 2.1. In order to fairly compare SSLpheno and the baseline methods, we use a 5-fold cross-validation (CV) for evaluation. We obtain and preprocess the data of baseline methods according to their paper’s description.

Prediction performance comparison results between SSLpheno and baseline methods for macro-averaged metrics are presented in Table 2, while those for micro-averaged metrics are shown in Table 3. Generally, SSLpheno achieves superior results compared to baseline methods on both datasets HPO and DisGeNET. SSLpheno_GAE and SSLpheno adopting strategy pre-training achieve better results than HPODNets and HPOFiller, which adopt end-to-end training. This indicates that the pre-training strategy can alleviate the impact of data sparsity on model performance in gene–phenotype prediction problem. Additionally, compared with GraphPheno, which adopted sequence features, our model adopts GO as features and achieves better results. Furthermore, the methods based on graph, HPODNets and HPOFiller, basically outperform the network propagation-based methods tlDLP and BiRW. It is worth mentioning that BiRW utilized GO as a cross-domain knowledge source take a better performance than GraphPheno. It also illustrates that GO contributes to the prediction of gene–phenotype associations. These experimental results indicate that the SSL strategy our model employed is more suitable for predicting gene–phenotype associations. On the dataset DisGeNET, which is shown in Supplementary Fig. S1 to have extremely imbalanced phenotype annotation distribution, our model can still achieve the best prediction performance.

Table 2.

Performance of CV under macro-averaged metrics.

DatasetsMethodsTerm (11–30)
Term (31–100)
Term (101–300)
Term (>300)
All
AUPRFmaxAUPRFmaxAUPRFmaxAUPRFmaxAUPRFmax
HPO_2020HPODNets0.2900.3830.3180.4100.3600.4180.5380.5310.3370.413
GraphPheno0.1670.1580.1910.2170.2300.2600.4230.5180.2060.326
HPOFiller0.2570.2650.2710.3000.3120.3320.4940.5030.2880.356
tlDLP0.1410.1470.1560.1920.1970.2420.3880.4940.1740.383
BiRW0.1730.1870.1930.2250.2450.2810.4400.5200.2110.411
SSLpheno_GAE0.2990.3850.3220.4140.3500.4160.5040.5210.3370.414
SSLpheno0.3320.4020.3390.4150.3700.4320.5290.5310.3500.427
DisGeNETHPODNets0.2000.2880.2160.3180.2470.3330.4040.4310.2540.334
GraphPheno0.1560.2310.1770.2790.1320.2560.3130.3390.1910.299
HPOFiller0.0630.1250.0900.1740.1490.2360.3310.3770.1130.185
tlDLP0.1650.2210.1870.2760.2000.2990.3650.4000.2010.288
BiRW0.0880.2000.2080.2810.1870.2930.3560.3770.1990.301
SSLpheno_GAE0.2110.2910.1840.2960.2260.3020.3770.4140.2250.303
SSLpheno0.2250.2920.2430.3420.2550.3430.4070.4360.2750.348
DatasetsMethodsTerm (11–30)
Term (31–100)
Term (101–300)
Term (>300)
All
AUPRFmaxAUPRFmaxAUPRFmaxAUPRFmaxAUPRFmax
HPO_2020HPODNets0.2900.3830.3180.4100.3600.4180.5380.5310.3370.413
GraphPheno0.1670.1580.1910.2170.2300.2600.4230.5180.2060.326
HPOFiller0.2570.2650.2710.3000.3120.3320.4940.5030.2880.356
tlDLP0.1410.1470.1560.1920.1970.2420.3880.4940.1740.383
BiRW0.1730.1870.1930.2250.2450.2810.4400.5200.2110.411
SSLpheno_GAE0.2990.3850.3220.4140.3500.4160.5040.5210.3370.414
SSLpheno0.3320.4020.3390.4150.3700.4320.5290.5310.3500.427
DisGeNETHPODNets0.2000.2880.2160.3180.2470.3330.4040.4310.2540.334
GraphPheno0.1560.2310.1770.2790.1320.2560.3130.3390.1910.299
HPOFiller0.0630.1250.0900.1740.1490.2360.3310.3770.1130.185
tlDLP0.1650.2210.1870.2760.2000.2990.3650.4000.2010.288
BiRW0.0880.2000.2080.2810.1870.2930.3560.3770.1990.301
SSLpheno_GAE0.2110.2910.1840.2960.2260.3020.3770.4140.2250.303
SSLpheno0.2250.2920.2430.3420.2550.3430.4070.4360.2750.348
Table 2.

Performance of CV under macro-averaged metrics.

DatasetsMethodsTerm (11–30)
Term (31–100)
Term (101–300)
Term (>300)
All
AUPRFmaxAUPRFmaxAUPRFmaxAUPRFmaxAUPRFmax
HPO_2020HPODNets0.2900.3830.3180.4100.3600.4180.5380.5310.3370.413
GraphPheno0.1670.1580.1910.2170.2300.2600.4230.5180.2060.326
HPOFiller0.2570.2650.2710.3000.3120.3320.4940.5030.2880.356
tlDLP0.1410.1470.1560.1920.1970.2420.3880.4940.1740.383
BiRW0.1730.1870.1930.2250.2450.2810.4400.5200.2110.411
SSLpheno_GAE0.2990.3850.3220.4140.3500.4160.5040.5210.3370.414
SSLpheno0.3320.4020.3390.4150.3700.4320.5290.5310.3500.427
DisGeNETHPODNets0.2000.2880.2160.3180.2470.3330.4040.4310.2540.334
GraphPheno0.1560.2310.1770.2790.1320.2560.3130.3390.1910.299
HPOFiller0.0630.1250.0900.1740.1490.2360.3310.3770.1130.185
tlDLP0.1650.2210.1870.2760.2000.2990.3650.4000.2010.288
BiRW0.0880.2000.2080.2810.1870.2930.3560.3770.1990.301
SSLpheno_GAE0.2110.2910.1840.2960.2260.3020.3770.4140.2250.303
SSLpheno0.2250.2920.2430.3420.2550.3430.4070.4360.2750.348
DatasetsMethodsTerm (11–30)
Term (31–100)
Term (101–300)
Term (>300)
All
AUPRFmaxAUPRFmaxAUPRFmaxAUPRFmaxAUPRFmax
HPO_2020HPODNets0.2900.3830.3180.4100.3600.4180.5380.5310.3370.413
GraphPheno0.1670.1580.1910.2170.2300.2600.4230.5180.2060.326
HPOFiller0.2570.2650.2710.3000.3120.3320.4940.5030.2880.356
tlDLP0.1410.1470.1560.1920.1970.2420.3880.4940.1740.383
BiRW0.1730.1870.1930.2250.2450.2810.4400.5200.2110.411
SSLpheno_GAE0.2990.3850.3220.4140.3500.4160.5040.5210.3370.414
SSLpheno0.3320.4020.3390.4150.3700.4320.5290.5310.3500.427
DisGeNETHPODNets0.2000.2880.2160.3180.2470.3330.4040.4310.2540.334
GraphPheno0.1560.2310.1770.2790.1320.2560.3130.3390.1910.299
HPOFiller0.0630.1250.0900.1740.1490.2360.3310.3770.1130.185
tlDLP0.1650.2210.1870.2760.2000.2990.3650.4000.2010.288
BiRW0.0880.2000.2080.2810.1870.2930.3560.3770.1990.301
SSLpheno_GAE0.2110.2910.1840.2960.2260.3020.3770.4140.2250.303
SSLpheno0.2250.2920.2430.3420.2550.3430.4070.4360.2750.348
Table 3.

Performance of CV under micro-averaged metrics.

DatasetsMethodsTerm (11–30)
Term (31–100)
Term (101–300)
Term (>300)
All
AUPRFmaxAUPRFmaxAUPRFmaxAUPRFmaxAUPRFmax
HPO_2020HPODNets0.2030.2830.2890.3420.3510.3790.5850.5480.4120.441
Graphpheno0.1350.2130.1530.2240.2190.2680.5330.5450.3960.422
HPOFiller0.1890.2550.2440.3010.3060.3420.5730.5560.4350.453
tlDLP0.0710.1360.1190.2450.1790.2870.4700.5100.3100.389
BiRW0.1120.2010.1650.2630.2340.2840.5210.5360.3630.407
SSLpheno_GAE0.2250.3050.2910.3520.3350.3790.5890.5740.4740.496
SSLpheno0.2500.3300.2970.3500.3360.3860.5960.5760.4840.505
DisGeNETHPODNets0.1210.2060.1750.2570.2290.2950.4810.4670.3390.396
Graphpheno0.0690.1570.1120.1990.1450.2360.3980.3890.4780.481
HPOFiller0.0170.0560.0500.1130.1220.1980.4320.4420.3130.358
tlDLP0.0730.1670.1150.2040.1570.2450.4030.3960.4680.470
BiRW0.0890.1760.1440.2100.1890.2550.4360.4870.4530.449
SSLpheno_GAE0.0910.1900.1650.2420.1770.2540.5590.5270.4980.491
SSLpheno0.1360.2130.1800.2510.2330.2910.5740.5390.5150.504
DatasetsMethodsTerm (11–30)
Term (31–100)
Term (101–300)
Term (>300)
All
AUPRFmaxAUPRFmaxAUPRFmaxAUPRFmaxAUPRFmax
HPO_2020HPODNets0.2030.2830.2890.3420.3510.3790.5850.5480.4120.441
Graphpheno0.1350.2130.1530.2240.2190.2680.5330.5450.3960.422
HPOFiller0.1890.2550.2440.3010.3060.3420.5730.5560.4350.453
tlDLP0.0710.1360.1190.2450.1790.2870.4700.5100.3100.389
BiRW0.1120.2010.1650.2630.2340.2840.5210.5360.3630.407
SSLpheno_GAE0.2250.3050.2910.3520.3350.3790.5890.5740.4740.496
SSLpheno0.2500.3300.2970.3500.3360.3860.5960.5760.4840.505
DisGeNETHPODNets0.1210.2060.1750.2570.2290.2950.4810.4670.3390.396
Graphpheno0.0690.1570.1120.1990.1450.2360.3980.3890.4780.481
HPOFiller0.0170.0560.0500.1130.1220.1980.4320.4420.3130.358
tlDLP0.0730.1670.1150.2040.1570.2450.4030.3960.4680.470
BiRW0.0890.1760.1440.2100.1890.2550.4360.4870.4530.449
SSLpheno_GAE0.0910.1900.1650.2420.1770.2540.5590.5270.4980.491
SSLpheno0.1360.2130.1800.2510.2330.2910.5740.5390.5150.504
Table 3.

Performance of CV under micro-averaged metrics.

DatasetsMethodsTerm (11–30)
Term (31–100)
Term (101–300)
Term (>300)
All
AUPRFmaxAUPRFmaxAUPRFmaxAUPRFmaxAUPRFmax
HPO_2020HPODNets0.2030.2830.2890.3420.3510.3790.5850.5480.4120.441
Graphpheno0.1350.2130.1530.2240.2190.2680.5330.5450.3960.422
HPOFiller0.1890.2550.2440.3010.3060.3420.5730.5560.4350.453
tlDLP0.0710.1360.1190.2450.1790.2870.4700.5100.3100.389
BiRW0.1120.2010.1650.2630.2340.2840.5210.5360.3630.407
SSLpheno_GAE0.2250.3050.2910.3520.3350.3790.5890.5740.4740.496
SSLpheno0.2500.3300.2970.3500.3360.3860.5960.5760.4840.505
DisGeNETHPODNets0.1210.2060.1750.2570.2290.2950.4810.4670.3390.396
Graphpheno0.0690.1570.1120.1990.1450.2360.3980.3890.4780.481
HPOFiller0.0170.0560.0500.1130.1220.1980.4320.4420.3130.358
tlDLP0.0730.1670.1150.2040.1570.2450.4030.3960.4680.470
BiRW0.0890.1760.1440.2100.1890.2550.4360.4870.4530.449
SSLpheno_GAE0.0910.1900.1650.2420.1770.2540.5590.5270.4980.491
SSLpheno0.1360.2130.1800.2510.2330.2910.5740.5390.5150.504
DatasetsMethodsTerm (11–30)
Term (31–100)
Term (101–300)
Term (>300)
All
AUPRFmaxAUPRFmaxAUPRFmaxAUPRFmaxAUPRFmax
HPO_2020HPODNets0.2030.2830.2890.3420.3510.3790.5850.5480.4120.441
Graphpheno0.1350.2130.1530.2240.2190.2680.5330.5450.3960.422
HPOFiller0.1890.2550.2440.3010.3060.3420.5730.5560.4350.453
tlDLP0.0710.1360.1190.2450.1790.2870.4700.5100.3100.389
BiRW0.1120.2010.1650.2630.2340.2840.5210.5360.3630.407
SSLpheno_GAE0.2250.3050.2910.3520.3350.3790.5890.5740.4740.496
SSLpheno0.2500.3300.2970.3500.3360.3860.5960.5760.4840.505
DisGeNETHPODNets0.1210.2060.1750.2570.2290.2950.4810.4670.3390.396
Graphpheno0.0690.1570.1120.1990.1450.2360.3980.3890.4780.481
HPOFiller0.0170.0560.0500.1130.1220.1980.4320.4420.3130.358
tlDLP0.0730.1670.1150.2040.1570.2450.4030.3960.4680.470
BiRW0.0890.1760.1440.2100.1890.2550.4360.4870.4530.449
SSLpheno_GAE0.0910.1900.1650.2420.1770.2540.5590.5270.4980.491
SSLpheno0.1360.2130.1800.2510.2330.2910.5740.5390.5150.504

The HPO database provides data versions at different points in time, which we use for temporal validation to assess the performance of our model. The benchmark dataset is created using the method proposed in CAFA2 (Jiang et al. 2016), which includes genes that have no annotations in the HPO ontology at time t0 but accumulate at least one phenotype term with experimental validation between t0 and t1. According to this criterion, the training set consists of gene–phenotype associations released by the HPO database on 15 November 2019, while the test set consists of new associations added from 15 November 2019 to 25 August 2020. The distributions of the dataset are shown in Supplementary Fig. S2.

In temporal validation, we also calculate the performance comparison results separately for macro and micro measures, as shown in Tables 4 and 5, respectively. SSLpheno achieves the best results in almost all cases, implying that our model is more powerful for gene–phenotype association prediction with low-annotated data. Temporal validation actually is a form of de novo test with entirely worse performance than CV. It may be caused by the incompleteness of annotations. We furtherly discuss the case study of new genes/phenotypes prediction in Section 3.4.

Table 4.

Performance of temporal validation under macro-averaged metrics.

MethodsTerm (11–30)
Term (31–100)
Term (101–300)
Term (>300)
All
AUPRFmaxAUPRFmaxAUPRFmaxAUPRFmaxAUPRFmax
HPODNets0.0970.1380.0840.1370.1010.1700.2200.3060.1050.176
Graphpheno0.0750.1010.0760.1250.0820.1640.1990.2900.1020.165
HPOFiller0.0880.1030.0830.1300.0970.1640.2050.2990.1010.169
tlDLP0.0650.0790.0660.1060.0670.1230.1550.1990.1000.144
BiRW0.0730.0990.0650.1040.0790.1350.1640.1880.0950.144
SSLpheno_GAE0.1010.1340.0920.1470.1040.1850.2110.3150.1150.175
SSLpheno0.1080.1380.0930.1500.1110.1890.2440.3240.1210.182
MethodsTerm (11–30)
Term (31–100)
Term (101–300)
Term (>300)
All
AUPRFmaxAUPRFmaxAUPRFmaxAUPRFmaxAUPRFmax
HPODNets0.0970.1380.0840.1370.1010.1700.2200.3060.1050.176
Graphpheno0.0750.1010.0760.1250.0820.1640.1990.2900.1020.165
HPOFiller0.0880.1030.0830.1300.0970.1640.2050.2990.1010.169
tlDLP0.0650.0790.0660.1060.0670.1230.1550.1990.1000.144
BiRW0.0730.0990.0650.1040.0790.1350.1640.1880.0950.144
SSLpheno_GAE0.1010.1340.0920.1470.1040.1850.2110.3150.1150.175
SSLpheno0.1080.1380.0930.1500.1110.1890.2440.3240.1210.182
Table 4.

Performance of temporal validation under macro-averaged metrics.

MethodsTerm (11–30)
Term (31–100)
Term (101–300)
Term (>300)
All
AUPRFmaxAUPRFmaxAUPRFmaxAUPRFmaxAUPRFmax
HPODNets0.0970.1380.0840.1370.1010.1700.2200.3060.1050.176
Graphpheno0.0750.1010.0760.1250.0820.1640.1990.2900.1020.165
HPOFiller0.0880.1030.0830.1300.0970.1640.2050.2990.1010.169
tlDLP0.0650.0790.0660.1060.0670.1230.1550.1990.1000.144
BiRW0.0730.0990.0650.1040.0790.1350.1640.1880.0950.144
SSLpheno_GAE0.1010.1340.0920.1470.1040.1850.2110.3150.1150.175
SSLpheno0.1080.1380.0930.1500.1110.1890.2440.3240.1210.182
MethodsTerm (11–30)
Term (31–100)
Term (101–300)
Term (>300)
All
AUPRFmaxAUPRFmaxAUPRFmaxAUPRFmaxAUPRFmax
HPODNets0.0970.1380.0840.1370.1010.1700.2200.3060.1050.176
Graphpheno0.0750.1010.0760.1250.0820.1640.1990.2900.1020.165
HPOFiller0.0880.1030.0830.1300.0970.1640.2050.2990.1010.169
tlDLP0.0650.0790.0660.1060.0670.1230.1550.1990.1000.144
BiRW0.0730.0990.0650.1040.0790.1350.1640.1880.0950.144
SSLpheno_GAE0.1010.1340.0920.1470.1040.1850.2110.3150.1150.175
SSLpheno0.1080.1380.0930.1500.1110.1890.2440.3240.1210.182
Table 5.

Performance of temporal validation under micro-averaged metrics.

MethodsTerm (11–30)
Term (31–100)
Term (101–300)
Term (>300)
All
AUPRFmaxAUPRFmaxAUPRFmaxAUPRFmaxAUPRFmax
HPODNets0.0380.0760.0420.0900.0730.1240.2810.3500.1520.240
Graphpheno0.0210.0630.0380.0890.0620.1090.2670.3360.1470.216
HPOFiller0.0300.0720.0400.0890.0690.1160.2750.3470.1500.229
tlDLP0.0220.0630.0350.0770.0540.1060.2050.2680.1220.172
BiRW0.0300.0680.0330.0800.0550.0980.2010.2560.1290.193
SSLpheno_GAE0.0310.0720.0460.0920.0720.1280.3030.3670.1990.291
SSLpheno0.0320.0790.0470.0930.0750.1290.3120.3730.2130.300
MethodsTerm (11–30)
Term (31–100)
Term (101–300)
Term (>300)
All
AUPRFmaxAUPRFmaxAUPRFmaxAUPRFmaxAUPRFmax
HPODNets0.0380.0760.0420.0900.0730.1240.2810.3500.1520.240
Graphpheno0.0210.0630.0380.0890.0620.1090.2670.3360.1470.216
HPOFiller0.0300.0720.0400.0890.0690.1160.2750.3470.1500.229
tlDLP0.0220.0630.0350.0770.0540.1060.2050.2680.1220.172
BiRW0.0300.0680.0330.0800.0550.0980.2010.2560.1290.193
SSLpheno_GAE0.0310.0720.0460.0920.0720.1280.3030.3670.1990.291
SSLpheno0.0320.0790.0470.0930.0750.1290.3120.3730.2130.300
Table 5.

Performance of temporal validation under micro-averaged metrics.

MethodsTerm (11–30)
Term (31–100)
Term (101–300)
Term (>300)
All
AUPRFmaxAUPRFmaxAUPRFmaxAUPRFmaxAUPRFmax
HPODNets0.0380.0760.0420.0900.0730.1240.2810.3500.1520.240
Graphpheno0.0210.0630.0380.0890.0620.1090.2670.3360.1470.216
HPOFiller0.0300.0720.0400.0890.0690.1160.2750.3470.1500.229
tlDLP0.0220.0630.0350.0770.0540.1060.2050.2680.1220.172
BiRW0.0300.0680.0330.0800.0550.0980.2010.2560.1290.193
SSLpheno_GAE0.0310.0720.0460.0920.0720.1280.3030.3670.1990.291
SSLpheno0.0320.0790.0470.0930.0750.1290.3120.3730.2130.300
MethodsTerm (11–30)
Term (31–100)
Term (101–300)
Term (>300)
All
AUPRFmaxAUPRFmaxAUPRFmaxAUPRFmaxAUPRFmax
HPODNets0.0380.0760.0420.0900.0730.1240.2810.3500.1520.240
Graphpheno0.0210.0630.0380.0890.0620.1090.2670.3360.1470.216
HPOFiller0.0300.0720.0400.0890.0690.1160.2750.3470.1500.229
tlDLP0.0220.0630.0350.0770.0540.1060.2050.2680.1220.172
BiRW0.0300.0680.0330.0800.0550.0980.2010.2560.1290.193
SSLpheno_GAE0.0310.0720.0460.0920.0720.1280.3030.3670.1990.291
SSLpheno0.0320.0790.0470.0930.0750.1290.3120.3730.2130.300

3.3 Ablation study

3.3.1 Comparison of different components

To evaluate the contribution of each component, we implement models without each individual component for gene–phenotype association prediction. Finally, we compare SSLpheno with different methods, and the experimental results are presented in Table 6. The results show that each component of SSLpheno exhibits superiority in the phenotype prediction task. Specifically, applying a self-supervised pre-training strategy after any of the data preprocessing and feature smoothing methods can lead to a substantial improvement in prediction performance. Moreover, we also observe that the performance of the filter is limited on high-dimensional sparse features, and PCA is an effective feature dimension reduction method for SSLpheno.

Table 6.

Phenotype prediction performance with different components.

ComponentM-auprM-Fmaxm-auprm-Fmax
PCA0.2230.2950.3200.400
PCA+SSL0.2640.3330.3650.414
Filter+SSL0.3030.3820.4540.468
PCA+Filter+SSL0.3500.4270.4840.505
ComponentM-auprM-Fmaxm-auprm-Fmax
PCA0.2230.2950.3200.400
PCA+SSL0.2640.3330.3650.414
Filter+SSL0.3030.3820.4540.468
PCA+Filter+SSL0.3500.4270.4840.505
Table 6.

Phenotype prediction performance with different components.

ComponentM-auprM-Fmaxm-auprm-Fmax
PCA0.2230.2950.3200.400
PCA+SSL0.2640.3330.3650.414
Filter+SSL0.3030.3820.4540.468
PCA+Filter+SSL0.3500.4270.4840.505
ComponentM-auprM-Fmaxm-auprm-Fmax
PCA0.2230.2950.3200.400
PCA+SSL0.2640.3330.3650.414
Filter+SSL0.3030.3820.4540.468
PCA+Filter+SSL0.3500.4270.4840.505

3.3.2 Comparison of different GGAs

Different GGAs affect the prediction performance of the model (Wang et al. 2017). In addition to STRING, we employ two other well-used GGAs networks, HumanNet (Kim et al. 2022) and geneMANIA (Franz et al. 2018), to evaluate the impact of different GGAs on our model. The construction of the attributed network followed the approach described in Section 2.2.1. We use SSLpheno to make prediction based on GGAs built with the three different PPI networks and compare the results. Furthermore, we concatenate features obtained by pre-training on the three GGAs and feed them into the downstream classifier. The results of prediction performance with different GGAs are shown in Table 7.

Table 7.

Phenotype prediction performance with different GGAs.

GGAsM-auprM-Fmaxm-auprm-Fmax
HumanNet0.3140.4120.4110.462
GeneMANIA0.2070.2810.3480.409
STRING0.3500.4270.4840.505
Combined0.3200.3950.4600.488
GGAsM-auprM-Fmaxm-auprm-Fmax
HumanNet0.3140.4120.4110.462
GeneMANIA0.2070.2810.3480.409
STRING0.3500.4270.4840.505
Combined0.3200.3950.4600.488
Table 7.

Phenotype prediction performance with different GGAs.

GGAsM-auprM-Fmaxm-auprm-Fmax
HumanNet0.3140.4120.4110.462
GeneMANIA0.2070.2810.3480.409
STRING0.3500.4270.4840.505
Combined0.3200.3950.4600.488
GGAsM-auprM-Fmaxm-auprm-Fmax
HumanNet0.3140.4120.4110.462
GeneMANIA0.2070.2810.3480.409
STRING0.3500.4270.4840.505
Combined0.3200.3950.4600.488

The results indicate that SSLpheno with STRING achieves the best performance, while HumanNet has relatively poor performance with M-aupr of 0.306, and GeneMANIA has the lowest performance. This finding highlights the significance of the quality of GGA networks in gene–phenotype association prediction. Although some studies (Liu et al. 2020, 2022b) have shown that combining multiple networks can improves the predictive power of their models, our feature merging strategy does not achieve optimal result. In fact, the differences in the results of these three networks are likely due to differences in the methods used for database generation. The associations in STRING database are mainly derived from high-throughput experimental data (Szklarczyk et al. 2021), whereas the other two databases are generated by algorithm (Franz et al. 2018, Kim et al. 2022). Additionally, our findings suggest that simple concatenation of features may introduce noise and negatively impact downstream tasks.

To explore the embeddings produced by SSLpheno with different GGAs, we embed these vector representations into a 2D space using t-SNE and visualize in Supplementary Fig. S4. We highlight the associated genes with HP: 0002206 with red dots. The features from STRING seem to be a little more concentrated than the features from HumanNet. And the features are distributed entirely produced from GeneMANIA. This is also consistent with the results in Table 7. This suggests that the STRING-generated features are sufficiently capable of distinguishing the genes involved.

3.3.3 Comparison of different classifiers

In order to select a high-performance classifier for downstream task, we conduct experiments to compare the performance of different classifiers and evaluate their impact on SSLpheno. Specifically, to ensure a suitable evaluation, we keep the network construction and SSL components of the model unchanged. Apart from the DNN used in our model SSLpheno, we also introduce three commonly used machine-learning methods, which are Support Vector Machine (SVM), Random Forest (RF), and Logistic Regression (LR). Table 8 lists the results of the 5-fold CV performance. As can be seen, DNN achieves the best results under four metrics, and SVM achieves the second rank. It is remarkable that our method outperforms the baselines in Table 2 on all classifiers, which indicates that the pre-training strategy of our model is effective for gene–phenotype association prediction.

Table 8.

Phenotype prediction performance with different classifiers.

ClassifiersM-auprM-Fmaxm-auprm-Fmax
SVM0.3410.3790.4790.482
RF0.3260.3540.4200.447
LR0.2900.3050.3890.468
DNN0.3500.4270.4840.505
ClassifiersM-auprM-Fmaxm-auprm-Fmax
SVM0.3410.3790.4790.482
RF0.3260.3540.4200.447
LR0.2900.3050.3890.468
DNN0.3500.4270.4840.505
Table 8.

Phenotype prediction performance with different classifiers.

ClassifiersM-auprM-Fmaxm-auprm-Fmax
SVM0.3410.3790.4790.482
RF0.3260.3540.4200.447
LR0.2900.3050.3890.468
DNN0.3500.4270.4840.505
ClassifiersM-auprM-Fmaxm-auprm-Fmax
SVM0.3410.3790.4790.482
RF0.3260.3540.4200.447
LR0.2900.3050.3890.468
DNN0.3500.4270.4840.505

3.4 Case studies

To demonstrate the capability of SSLpheno for predicting gene–phenotype associations in practical applications, we validate its performance by conducting case studies from two perspectives.

3.4.1 Case study on predicted phenotypes related to genes

By comparing the similarity between predicted phenotypes related to genes and disease phenotypes, we can prioritize genetic variations that lead to diseases or further predict the relationships between genes and diseases, thus improving the identification of pathogenic variations in a clinical setting. In this case study, our primary objective is to assess the suitability of the model-generated phenotype annotations for gene–disease association predictions. We select three genes, CACNA2D1, FBXW4, and RAB11A to conduct this case study. CACNA2D1 is related to 18 phenotypes in the 2020 database version, which has to be proved one of the genes encoding the α2δ “subunit” and its high expression levels is closely associated with a poor prognosis in ovarian cancer, lung cancer, and hepatocellular carcinoma (Inoue et al. 2022). FBXW4, a newly added gene in the 2023 HPO, is associated with acute myeloid leukemia and serves as a reliable predictor of chemotherapy sensitivity (Zhang et al. 2020). RAB11A encodes the protein belonging to the Rab family of small GTPase and plays an important role in human carcinogenesis (Wang et al. 2022b), which is a de novo without phenotype annotation registered in HPO. Each gene of these three has the highest predicted score in its category.

We use Phenomizer (Köhler et al. 2009), a popular tool for disease-ranking, to evaluate our predictions. Table 9 lists the candidate diseases recommended by Phenomizer and supported evidences. The prediction results by SSLpheno show that CACNA2D1 is associated with 36 phenotypes out of 9226 phenotypes. The predicted phenotypes are used to further predict its related diseases, and the disease with the highest prediction score is epilepsy as shown in Table 9. This result is confirmed by Dahimene et al. (2022), they demonstrated that biallelic CACNA2D1 variants cause a phenotypic spectrum ranging from congenital ataxia with cerebellar vermian atrophy on brain imaging to cerebellar atrophy and developmental and epileptic encephalopathies. SSLpheno predicts that FBXW4 is associated with 235 phenotypes, which is similarity with disease “split hand/foot malformation” (SHFM). Qiu et al. (2022) provided evidence by using trio clinical exome sequencing that abnormal FBXW4 expression can cause SHFM. SSLpheno prediction results to infer that RAB11A is associated with mental developmental disorders, however, we have no direct evidence for this in the literature. Only the literature has shown thatRAB11A is a key gene associated with Parkinson’s disease (Yin et al. 2021). This shows it is difficult to make accurate predictions for genes without prior knowledge.

Table 9.

Gene–disease annotations calculated with the phenotypes predicted by SSLpheno.

GeneDisease IDDisease nameScore (P < 0.05)Evidence
CACNA2D1OMIM: 300491Epilepsy2.29(Dahimene et al. 2022)
FBXW4ORPHANET: 2440Split hand/split foot malformation2.47(Qiu et al. 2022)
RAB11AOMIM: 30080Mental retardation0.49(Hill et al. 2019)
GeneDisease IDDisease nameScore (P < 0.05)Evidence
CACNA2D1OMIM: 300491Epilepsy2.29(Dahimene et al. 2022)
FBXW4ORPHANET: 2440Split hand/split foot malformation2.47(Qiu et al. 2022)
RAB11AOMIM: 30080Mental retardation0.49(Hill et al. 2019)
Table 9.

Gene–disease annotations calculated with the phenotypes predicted by SSLpheno.

GeneDisease IDDisease nameScore (P < 0.05)Evidence
CACNA2D1OMIM: 300491Epilepsy2.29(Dahimene et al. 2022)
FBXW4ORPHANET: 2440Split hand/split foot malformation2.47(Qiu et al. 2022)
RAB11AOMIM: 30080Mental retardation0.49(Hill et al. 2019)
GeneDisease IDDisease nameScore (P < 0.05)Evidence
CACNA2D1OMIM: 300491Epilepsy2.29(Dahimene et al. 2022)
FBXW4ORPHANET: 2440Split hand/split foot malformation2.47(Qiu et al. 2022)
RAB11AOMIM: 30080Mental retardation0.49(Hill et al. 2019)

3.4.2 False positive analysis of predicted genes related to phenotype terms

Since the known gene–phenotype associations are serious incomplete, some of the false positive we predicted are actually correct associations. We analysis the false positive results predicted by SSLpheno. “Pneumonia” (HP: 0002090) is a clinical symptom of lower respiratory tract infection. Coronavirus disease 2019 is a special disease, which has attracted close attention of scientists in the past 2 years, and “pneumonia” is one of its main clinical phenotypes (Wang et al. 2023a). “Pneumonia” is annotated with 88 genes in the HPO database in August 2020. We predict 136 genes with SSLpheno, where 65 genes are true positive while 71 are false positive. Table 10 lists the details of top 15 predicted false positive genes of “Pneumonia” (HP: 0002090). We investigate if the prediction results were reported in the latest HPO version, GWAS studies and PubMed. Eleven of the predicted genes have supported evidence. Three genes among them are newly added in 2023, 4 associations with cases can be found in GWAS Catalog, and 10 of them are related with PubMed literatures.

Table 10.

The top 15 predicted false positive genes of pneumonia with supporting literature.

RankGene nameEvidence
1IL12RB1PMID: 34306143 (Xu et al. 2021)
2HMBS
3ACE2PMID: 35241825 (Horowitz et al. 2022)
4IL-7PMID: 32728202 (Monneret et al. 2020)
5IREB2PMID: 33304392 (Zeng et al. 2020)
6HPGDPMID: 35180297 (Feitosa et al. 2022)
7DNAL4
8ANO3PMID: 35513865 (Tang et al. 2022)
9DAB1PMID: 33888907 (Shelton et al. 2021)
10KREMEN1
11SEC24D
12MTM1HPO 2023
13FOCADPMID: 25637605 (Pouwels et al. 2015)
14TYRP1PMID: 35910207 (Zecevic et al. 2022)
15IREB2PMID: 34340725 (Campos et al. 2021)
RankGene nameEvidence
1IL12RB1PMID: 34306143 (Xu et al. 2021)
2HMBS
3ACE2PMID: 35241825 (Horowitz et al. 2022)
4IL-7PMID: 32728202 (Monneret et al. 2020)
5IREB2PMID: 33304392 (Zeng et al. 2020)
6HPGDPMID: 35180297 (Feitosa et al. 2022)
7DNAL4
8ANO3PMID: 35513865 (Tang et al. 2022)
9DAB1PMID: 33888907 (Shelton et al. 2021)
10KREMEN1
11SEC24D
12MTM1HPO 2023
13FOCADPMID: 25637605 (Pouwels et al. 2015)
14TYRP1PMID: 35910207 (Zecevic et al. 2022)
15IREB2PMID: 34340725 (Campos et al. 2021)
Table 10.

The top 15 predicted false positive genes of pneumonia with supporting literature.

RankGene nameEvidence
1IL12RB1PMID: 34306143 (Xu et al. 2021)
2HMBS
3ACE2PMID: 35241825 (Horowitz et al. 2022)
4IL-7PMID: 32728202 (Monneret et al. 2020)
5IREB2PMID: 33304392 (Zeng et al. 2020)
6HPGDPMID: 35180297 (Feitosa et al. 2022)
7DNAL4
8ANO3PMID: 35513865 (Tang et al. 2022)
9DAB1PMID: 33888907 (Shelton et al. 2021)
10KREMEN1
11SEC24D
12MTM1HPO 2023
13FOCADPMID: 25637605 (Pouwels et al. 2015)
14TYRP1PMID: 35910207 (Zecevic et al. 2022)
15IREB2PMID: 34340725 (Campos et al. 2021)
RankGene nameEvidence
1IL12RB1PMID: 34306143 (Xu et al. 2021)
2HMBS
3ACE2PMID: 35241825 (Horowitz et al. 2022)
4IL-7PMID: 32728202 (Monneret et al. 2020)
5IREB2PMID: 33304392 (Zeng et al. 2020)
6HPGDPMID: 35180297 (Feitosa et al. 2022)
7DNAL4
8ANO3PMID: 35513865 (Tang et al. 2022)
9DAB1PMID: 33888907 (Shelton et al. 2021)
10KREMEN1
11SEC24D
12MTM1HPO 2023
13FOCADPMID: 25637605 (Pouwels et al. 2015)
14TYRP1PMID: 35910207 (Zecevic et al. 2022)
15IREB2PMID: 34340725 (Campos et al. 2021)

Furthermore, we also conduct analysis on “Abnormality of cardiovascular system morphology” (ACSM, HP: 0030680). ACSM is a common phenotype of many complex diseases and has 1478 annotated genes in the HPO as of August 2020. We obtain 1896 genes related to ACSM, of which 1088 are true positive and 812 are false positive. The same case study method as “pneumonia” is used to get the case evidences as shown in Supplementary Table S2. Among the top 15 genes identified as false positives, we find supporting evidence for seven of them. “Angiogenic factor with G-patch and FHA domain1” (AGGF1) is a gene encoding a new angiogenic factor discovered by studying human congenital venous malformation limb hypertrophy syndrome. It is the only gene among the seven that is listed in HPO. The other six genes, although not present in HPO, have been proven to be closely related to ACSM. This also confirms the incomplete relationship between human genes and phenotypes in HPO.

4 Conclusion

In this article, we introduce a pre-training model SSLpheno, which is trained in a self-supervised manner on an attributed graph constructed from PPIs and gene ontology, and can predict disease phenotypes caused by genes lacking prior knowledge. We design a pretext task that selects positive and negative gene pairs based on cosine similarity, reconstruct their labels, and trains the optimal feature representation for downstream phenotype prediction. The experimental results demonstrate that our proposed model outperforms SOTA models in terms of performance. Further ablation studies suggest that each component in SSLpheno is necessary for our task. Multiple case studies from different perspectives also demonstrate the practicality of the SSLpheno model.

Our work shows the advantage of gene-enriched functional feature representation in phenotype prediction. However, during automatic encoding, the calculation of similarity between all pairs of nodes can be challenging for the computational performance of large networks. Overall, our proposed SSL method for phenotype prediction has the potential to meet the increasing demand for annotation data. In the future, we plan to explore methods that utilize existing annotation data to improve gene–phenotype prediction performance, providing new insights into the interpretation of genetic heterogeneity in diseases. Additionally, we aim to develop reasonable multi-source network feature fusion methods for gene–phenotype prediction as a future research direction.

Conflict of interest

None declared.

Funding

This work was supported by the National Key Research and Development Program of China [grant number 2021YFF1201200]; the National Natural Science Foundation of China [grant numbers 62350004, U1909208, 61972423]; the Science and Technology Major Project of Changsha (No. kh2202004); and the Key Research and Development Program of Xinjiang Uygur Autonomous Region [grant number 2022B03023]. This work was carried out in part using computing resources at the High Performance Computing Center of Central South University.

Data availability

The code and datasets are available at https://github.com/bixuehua/SSLpheno.

References

Alghamdi
SM
,
Schofield
PN
,
Hoehndorf
R
et al.
Contribution of model organism phenotypes to the computational identification of human disease genes
.
Dis Model Mech
2022
;
15
:
dmm049441
.

Bastarache
L
,
Denny
JC
,
Roden
DM
et al.
Phenome-wide association studies
.
JAMA
2022
;
327
:
75
6
.

Beck
T
,
Shorter
T
,
Brookes
AJ
et al.
GWAS Central: a comprehensive resource for the discovery and comparison of genotype and phenotype data from genome-wide association studies
.
Nucleic Acids Res
2020
;
48
:
D933
40
.

Bone
WP
,
Washington
NL
,
Buske
OJ
et al.
Computational evaluation of exome sequence data using human and model organism phenotypes improves diagnostic efficiency
.
Genet Med
2016
;
18
:
608
17
.

Campos
AI
,
Kho
P
,
Vazquez-Prada
KX
et al.
Genetic susceptibility to pneumonia: a GWAS meta-analysis between the UK Biobank and FinnGen
.
Twin Res Hum Genet
2021
;
24
:
145
54
.

Chen
J
,
Althagafi
A
,
Hoehndorf
R
et al.
Predicting candidate genes from phenotypes, functions and anatomical site of expression
.
Bioinformatics
2021
;
37
:
853
60
.

Claussnitzer
M
,
Cho
JH
,
Collins
R
et al.
A brief history of human disease genetics
.
Nature
2020
;
577
:
179
89
.

Cui
G
,
Zhou
J
,
Yang
C
et al. Adaptive graph encoder for attributed graph embedding. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Association for Computing Machinery, New York, United States, July 6–10, 2020.

Dahimene
S
,
von Elsner
L
,
Holling
T
et al.
Biallelic CACNA2D1 loss-of-function variants cause early-onset developmental epileptic encephalopathy
.
Brain
2022
;
145
:
2721
9
.

Fan
K
,
Guan
Y
,
Zhang
Y
et al.
Graph2GO: a multi-modal attributed network embedding method for inferring protein functions
.
Gigascience
2020
;
9
:
giaa081
.

Feitosa
MF
,
Wojczynski
MK
,
Anema
JA
et al.
Genetic pleiotropy between pulmonary function and age-related traits: the long life family study
.
J Gerontol A
2022
;Series A:glac046.

Franz
M
,
Rodriguez
H
,
Lopes
C
et al.
GeneMANIA update 2018
.
Nucleic Acids Res
2018
;
46
:
W60
4
.

Guala
D
,
Ogris
C
,
Müller
N
et al.
Genome-wide functional association networks: background, data & state-of-the-art resources
.
Brief Bioinform
2020
;
21
:
1224
37
.

Hill
WD
,
Marioni
RE
,
Maghzian
O
et al.
A combined analysis of genetically correlated traits identifies 187 loci and a role for neurogenesis and myelination in intelligence
.
Mol Psychiatry
2019
;
24
:
169
81
.

Hoehndorf
R
,
Schofield
PN
,
Gkoutos
GV
et al.
PhenomeNET: a whole-phenome approach to disease gene discovery
.
Nucleic Acids Res
2011
;
39
:
e119
.

Horowitz
JE
,
Kosmicki
JA
,
Damask
A
et al. ;
Regeneron Genetics Center
.
Genome-wide analysis provides genetic evidence that ACE2 influences COVID-19 risk and yields risk scores associated with severe disease
.
Nat Genet
2022
;
54
:
382
92
.

Hu
L
,
Wang
X
,
Huang
Y-A
et al.
A survey on computational models for predicting protein–protein interactions
.
Brief Bioinform
2021
;
22
:
bbab036
.

Hu
W
,
Liu
B
,
Gomes
J
et al. Strategies for pre-training graph neural networks. In: 8th International Conference on Learning Representation (ICLR), OpenReview, Apr 26–30, 2020.

Huang
X
,
Li
J
,
Hu
X
. Accelerated attributed network embedding. In: Proceedings of the 2017 SIAM International Conference on Data Mining (SDM), Texas, USA, Apr 27–29,
2017
.

Inoue
H
,
Shiozaki
A
,
Kosuga
T
et al.
Functions and clinical significance of CACNA2D1 in gastric cancer
.
Ann Surg Oncol
2022
;
29
:
4522
35
.

Jiang
Y
,
Oron
TR
,
Clark
WT
et al.
An expanded evaluation of protein function prediction methods shows an improvement in accuracy
.
Genome Biol
2016
;
17
:
184
.

Kim
CY
,
Baek
S
,
Cha
J
et al.
HumanNet v3: an improved database of human gene networks for disease research
.
Nucleic Acids Res
2022
;
50
:
D632
9
.

Kipf
TN
,
Welling
M.
Variational graph auto-encoders. In: Neural Information Processing Systems Workshop on Bayesian Deep Learning (NIPS), Barcelona, Spain, Dec 10,
2016
.

Köhler
S
,
Gargano
M
,
Matentzoglu
N
et al.
The human phenotype ontology in 2021
.
Nucleic Acids Res
2021
;
49
:
D1207
17
.

Köhler
S
,
Schulz
MH
,
Krawitz
P
et al.
Clinical diagnostics in human genetics with semantic similarity searches in ontologies
.
Am J Hum Genet
2009
;
85
:
457
64
.

Kulmanov
M
,
Hoehndorf
R.
DeepPheno: predicting single gene loss-of-function phenotypes using an ontology-aware hierarchical classifier
.
PLoS Comput Biol
2020
;
16
:
e1008453
.

Kulmanov
M
,
Smaili
FZ
,
Gao
X
et al.
Semantic similarity and machine learning with ontologies
.
Brief Bioinform
2021
;
22
:
bbaa199
.

LeCun
Y
,
Bengio
Y
,
Hinton
G
et al.
Deep learning
.
Nature
2015
;
521
:
436
44
.

Liu
H
,
Huang
Y
,
Liu
X
et al.
Attention-wise masked graph contrastive learning for predicting molecular property
.
Brief Bioinform
2022a
;
23
:
bbac303
.

Liu
L
,
Huang
X
,
Mamitsuka
H
et al.
HPOLabeler: improving prediction of human protein–phenotype associations by learning to rank
.
Bioinformatics
2020
;
36
:
4180
8
.

Liu
L
,
Mamitsuka
H
,
Zhu
S
et al.
HPOFiller: identifying missing protein–phenotype associations by graph convolutional network
.
Bioinformatics
2021
;
37
:
3328
36
.

Liu
L
,
Mamitsuka
H
,
Zhu
S
et al.
HPODNets: deep graph convolutional networks for predicting human protein–phenotype associations
.
Bioinformatics
2022b
;
38
:
799
808
.

Liu
Y
,
He
R
,
Qu
Y
et al.
Integration of human protein sequence and protein-protein interaction data by graph autoencoder to identify novel protein-abnormal phenotype associations
.
Cells
2022c
;
11
:
2485
.

Liu
Y
,
Jin
M
,
Pan
S
et al.
Graph self-supervised learning: a survey
.
IEEE Trans Knowl Data Eng
2022d
;
35
:
5879
900
.

Luo
F
,
Yang
Y
,
Zhong
J
et al.
Constructing gene co-expression networks and predicting functions of unknown genes by random matrix theory
.
BMC Bioinformatics
2007
;
8
:
299
.

Ma
Y
,
Zhang
X
,
Gao
C
et al.
Enhancing recommendations with contrastive learning from collaborative knowledge graph
.
Neurocomputing
2023
;
523
:
103
15
.

Monneret
G
,
de Marignan
D
,
Coudereau
R
et al.
Immune monitoring of interleukin-7 compassionate use in a critically ill COVID-19 patient
.
Cell Mol Immunol
2020
;
17
:
1001
3
.

Oti
M
,
Brunner
HG.
The modular nature of genetic diseases
.
Clin Genet
2007
;
71
:
1
11
.

Peng
J
,
Wang
Y
,
Guan
J
et al.
An end-to-end heterogeneous graph representation learning-based framework for drug–target interaction prediction
.
Brief Bioinform
2021
;
22
:
bbaa430
.

Petegrosso
R
,
Park
S
,
Hwang
TH
et al.
Transfer learning across ontologies for phenome–genome association prediction
.
Bioinformatics
2017
;
33
:
529
36
.

Piñero
J
,
Ramírez-Anguita
JM
,
Saüch-Pitarch
J
et al.
The disGeNET knowledge platform for disease genomics: 2019 update
.
Nucleic Acids Res
2020
;
48
:
D845
55
.

Pourreza Shahri
M
,
Kahanda
I.
Deep semi-supervised learning ensemble framework for classifying co-mentions of human proteins and phenotypes
.
BMC Bioinformatics
2021
;
22
:
500
22
.

Pouwels
SD
,
Heijink
IH
,
Brouwer
U
et al.
Genetic variation associates with susceptibility for cigarette smoke-induced neutrophilia in mice
.
Am J Physiol Lung Cell Mol Physiol
2015
;
308
:
L693
709
.

Qiu
J
,
Chen
Q
,
Dong
Y
et al. GCC: Graph contrastive coding for graph neural network pre-training. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Association for Computing Machinery, New York, USA, Jul 6–10,
2020
.

Qiu
L
,
Li
C
,
Zheng
G
et al.
Microduplication of BTRC detected in a Chinese family with split hand/foot malformation type 3
.
Clin Genet
2022
;
102
:
451
6
.

Ranea
JAG
,
Perkins
J
,
Chagoyen
M
et al.
Network-based methods for approaching human pathologies from a phenotypic point of view
.
Genes (Basel)
2022
;
13
:
1081
.

Robinson
PN.
Deep phenotyping for precision medicine
.
Hum Mutat
2012
;
33
:
777
80
.

Schriml
LM
,
Munro
JB
,
Schor
M
et al.
The human disease ontology 2022 update
.
Nucleic Acids Res
2022
;
50
:
D1255
61
.

Senior
AW
,
Evans
R
,
Jumper
J
et al.
Improved protein structure prediction using potentials from deep learning
.
Nature
2020
;
577
:
706
10
.

Shelton
JF
,
Shastri
AJ
,
Ye
C
et al. ;
23andMe COVID-19 Team
.
Trans-ancestry analysis reveals genetic and nongenetic associations with COVID-19 susceptibility and severity
.
Nat Genet
2021
;
53
:
801
8
.

Smedley
D
,
Oellrich
A
,
Köhler
S
et al. ;
Sanger Mouse Genetics Project
.
PhenoDigm: analyzing curated annotations to associate animal models with human diseases
.
Database
2013
;
2013
:
bat025
.

Szklarczyk
D
,
Gable
AL
,
Nastou
KC
et al.
The string database in 2021: customizable protein–protein networks, and functional characterization of user-uploaded gene/measurement sets
.
Nucleic Acids Res
2021
;
49
:
10800
.

Tang
L
,
Zhong
X
,
Gong
H
et al.
Analysis of the association of ANO3/MUC15, COL4A4, RRBP1, and KLK1 polymorphisms with COPD susceptibility in the Kashi population
.
BMC Pulm Med
2022
;
22
:
178
.

Wang
C
,
Pan
S
,
Hu
R
et al. Attributed graph clustering: a deep attentional embedding approach. In: Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI), AAAI Press, Macao China, Aug 10–16,
2019.

Wang
K
,
Guo
Z
,
Zeng
T
et al.
Transmission characteristics and inactivated vaccine effectiveness against transmission of SARS-CoV-2 Omicron BA. 5 variants in Urumqi, China
.
JAMA Netw Open
2023a
;
6
:
e235755
.

Wang
P
,
Lai
W-F
,
Li
MJ
et al.
Inference of gene-phenotype associations via protein-protein interaction and orthology
.
PLoS One
2013
;
8
:
e77478
.

Wang
Y
,
Juan
L
,
Chu
Y
et al. FNSemSim: an improved disease similarity method based on network fusion. In: 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE Computer Society, Kansas City, MO, USA, Nov 13–16,
2017
.

Wang
Y
,
Juan
L
,
Peng
J
et al.
Explore potential disease related metabolites based on latent factor model
.
BMC Genomics
2022a
;
23
:
269
.

Wang
Y
,
Liu
X
,
Shen
Y
et al.
Collaborative deep learning improves disease-related circRNA prediction based on multi-source functional information
.
Brief Bioinform
2023b
;
24
:
bbad069
.

Wang
Y
,
Ren
Y
,
Li
N
et al.
Rab11a promotes the malignant progression of ovarian cancer by inducing autophagy
.
Genes Genomics
2022b
;
44
:
1375
84
.

Washington
NL
,
Haendel
MA
,
Mungall
CJ
et al.
Linking human diseases to animal models using ontology-based phenotype annotation
.
PLoS Biol
2009
;
7
:
e1000247
.

Xiang
J
,
Zhang
J
,
Zhao
Y
et al.
Biomedical data, computational methods and tools for evaluating disease–disease associations
.
Brief Bioinform
2022
;
23
:
bbac006
.

Xie
M
,
Xu
Y
,
Zhang
Y
et al.
Network-based phenome-genome association prediction by bi-random walk
.
PLoS One
2015
;
10
:
e0125138
.

Xu
Y
,
Cai
J
,
Li
W
et al.
Examining the effector mechanisms of the Feishu acupoint (BL13) in the treatment of pneumonia based on systematic acupuncture and moxibustion research
.
Evid Based Complement Alternat Med
2021
;
2021
:
1
.

Xue
H
,
Peng
J
,
Shang
X
et al.
Predicting disease-related phenotypes using an integrated phenotype similarity measurement based on HPO
.
BMC Syst Biol
2019a
;
13
:
34
.

Xue
H
,
Peng
J
,
Shang
X
et al. Towards gene function prediction via multi-networks representation learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, Hawaii, USA, Jan 27–Feb 1,
2019
.

Yin
X
,
Wang
M
,
Wang
W
et al.
Identification of potential miRNA-mRNA regulatory network contributing to Parkinson’s disease
.
Parkinsons Dis
2021
;
2022
:
2877728
.

You
Y
,
Shen
Y.
Cross-modality and self-supervised protein embedding for compound–protein affinity and contact prediction
.
Bioinformatics
2022
;
38
:
ii68
74
.

Yuan
Q
,
Chen
J
,
Zhao
H
et al.
Structure-aware protein–protein interaction site prediction using deep graph convolutional network
.
Bioinformatics
2022a
;
38
:
125
32
.

Yuan
X
,
Wang
J
,
Dai
B
et al.
Evaluation of phenotype-driven gene prioritization methods for Mendelian diseases
.
Brief Bioinform
2022b
;
23
:
bbac019
.

Yuan
Y
,
Bar-Joseph
Z.
Deep learning for inferring gene relationships from single-cell expression data
.
Proc Natl Acad Sci USA
2019
;
116
:
27151
8
.

Zecevic
M
,
Kotur
N
,
Ristivojevic
B
et al.
Genome-wide association study of covid-19 outcomes reveals novel host genetic risk loci in the Serbian population
.
Front Genet
2022
;
13
:
911010
.

Zeng
Q
,
Chen
Q
,
Zou
D
et al.
Different associations between the IREB2 variants and chronic obstructive pulmonary disease susceptibility
.
Front Genet
2020
;
11
:
598053
.

Zhang
X
,
Wang
X
,
Shivashankar
GV
et al.
Graph-based autoencoder integrates spatial transcriptomics with chromatin images and identifies joint biomarkers for Alzheimer’s disease
.
Nat Commun
2022
;
13
:
7480
.

Zhang
Y
,
Sun
L
,
Wang
X
et al.
FBXW4 acts as a protector of FOLFOX-based chemotherapy in metastatic colorectal cancer identified by co-expression network analysis
.
Front Genet
2020
;
11
:
113
.

Zheng
Z
,
Tan
Y
,
Wang
H
et al.
CasANGCL: pre-training and fine-tuning model based on cascaded attention network and graph contrastive learning for molecular property prediction
.
Brief Bioinform
2023
;
24
:
bbac566
.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
Associate Editor: Pier Luigi Martelli
Pier Luigi Martelli
Associate Editor
Search for other works by this author on:

Supplementary data