-
PDF
- Split View
-
Views
-
Cite
Cite
Yanhong Huang, Xiao Chang, Yu Zhang, Luonan Chen, Xiaoping Liu, Disease characterization using a partial correlation-based sample-specific network, Briefings in Bioinformatics, Volume 22, Issue 3, May 2021, bbaa062, https://doi.org/10.1093/bib/bbaa062
- Share Icon Share
Abstract
A single-sample network (SSN) is a biological molecular network constructed from single-sample data given a reference dataset and can provide insights into the mechanisms of individual diseases and aid in the development of personalized medicine. In this study, we proposed a computational method, a partial correlation-based single-sample network (P-SSN), which not only infers a network from each single-sample data given a reference dataset but also retains the direct interactions by excluding indirect interactions (https://github.com/hyhRise/P-SSN). By applying P-SSN to analyze tumor data from the Cancer Genome Atlas and single cell data, we validated the effectiveness of P-SSN in predicting driver mutation genes (DMGs), producing network distance, identifying subtypes and further classifying single cells. In particular, P-SSN is highly effective in predicting DMGs based on single-sample data. P-SSN is also efficient for subtyping complex diseases and for clustering single cells by introducing network distance between any two samples.
Introduction
Personalized medicine allows a patient to receive specific treatments designed for his or her individual disease, thus improving the efficiency of the health system [1]. The key point of personalized medicine is to clarify the patient/individual-specific molecular drivers of disease, which are caused by the dysfunction of individual-specific networks rather than the malfunction of single genes [2, 3]. Network medicine, which applies the theories and methods of complex networks to the study of disease and the development of medical drugs [3–5], can identify the dysfunctional interactions or disordered regulation involved in diseases from a system’s viewpoint. However, most molecular networks are constructed from multisample data (or existing databases) rather than from individual samples, thus ignoring patient/individual-specific information, and are not suitable for personalized medicine. In this study, we propose a novel method using partial-correlation based on a single-sample network or sample-specific network (P-SSN), to construct sample-specific network with direct interactions based on partial correlation coefficients (PTCCs) for individual sample data. In actuality, previously described methods can be used to construct an SSN by estimating the perturbations of Pearson’s correlation coefficient (PCC) for each pair of genes in a single sample based on a reference sample set [6], but these methods cannot distinguish direct and indirect interactions.
Compared with the previously described SSN, the P-SSN method can not only construct an SSN from single-sample data based on a reference dataset, but also exclude indirect/cascaded interactions from the SSN. The DERA method [7] integrates expression data with biological network information at a single-sample level. The sample-specific networks based on DERA have been subsequently used to discover samples with similar molecular functions by the identification of regulations that are shared between samples or are specific for a subgroup [7]. The Edge-markers method identifies a set of differentially correlated gene pairs by calculating the difference between a normal network from a group of normal samples and a disease network from a group of tumor samples [8]. Then Edge-markers ‘expressions’ are extracted from a new single sample to predict the phenotype of that sample. This method does not construct a sample-specific network of a single sample. The LIONESS method is used to construct a sample-specific network by comparing a network deduced from a population to a related network for the population missing from a single sample, but for each new test sample beyond the original m samples, it requires the whole computation to be redone and, thus, is not suitable for clinic applications [9]. Moreover, the PEEP method shares some similarities to our approach and it determines a single-sample profile by comparison with a reference set, but this method only uses perturbation expression profiles of genes and ignores the regulations between genes [10].
P-SSN can be used to predict potential individual-specific driver mutation genes (DMGs) based on single-sample data given a reference dataset. P-SSN can also distinguish tumor samples from normal samples and can cluster the differential types of tumor samples. In addition, the network distance between P-SSNs can be used as a new similarity measure between samples to improve the accuracy of a clustering algorithm for identifying the potential subtypes in complex tumor samples. Moreover, the single-cell gene expression landscape can provide important biological insights into the processes driving development or disease [11]. A partial correlation-based cell-specific network (P-CSN) can be constructed based on the procedures used to create a P-SSN. Using the analyses of eight scRNA-seq datasets [12–19], we found that the single-cell clustering by network distance between P-CSNs obtained better results compared to other scRNA-seq clustering methods.
Results
A PTCC is a statistical coefficient that measures the direct association between two genes by removing the effect of other genes (Figure 1A). A single-sample PTCC—which is derived from PTCC (Figure 1A)—can be used to test the direct correlation between two genes in a single sample based on a reference sample set (see Figure 1 and Materials and Methods section). For a single sample, the P-SSN can be constructed by linking all gene pairs that have high sample-specific PTCC (sPTCC) between genes (see Figures 1 and 2 and Materials and Methods section).

Framework for the sample-specific partial correlation coefficient (PTCC). (A) The PTCC between genes X and Y given a control gene Z is the correlation between residuals ex and ey that result from the linear regression of gene X with Z and of gene Y with Z, respectively. For a single sample d, PTCC is the correlation between residuals (ex, exd) and (ey, eyd), exd and eyd are the residuals of points (zd, xd) and (zd, yd) to the regression lines of X with Z and of Y with Z, respectively. (B) The sample-specific PTCC (sPTCC) can be defined as |$sPTCC( XY|Z)={PTCC}_{m+1}( XY|Z)-{PTCC}_m( XY|Z)$|, and the statistical significance of sPTCC of an edge can be calculated by the sample-specific network method.

Flowchart for building a partial correlation-based sample-specific network. (A) For a set of reference samples (m samples), the background network was composed of edges formed by interactions with high correlation coefficients (the threshold of high correlation is set to 0.7). The reference network can be built by calculating the PTCCm between any two genes gi and gj controlled by gk that connected with both gi and gj on the background network. Usually, the reference network contains the common attributes of reference samples. (B) A new single sample d is added to this reference samples, the perturbed network was built by the PTCCm + 1 between any two genes on the background. The difference between the perturbed network and reference network is due to the new sample d added to the reference samples. (C) The sample-specific network is built by the significant difference of the corresponding edge between the perturbed network and reference network in terms of PTCC, i.e. |$sPTCC\Big( XY|Z\Big)={PTCC}_{m+1}\Big( XY|Z\Big)-{PTCC}_m\Big( XY|Z\Big)$| for each edge. If the sPTCC of an edge is statistically significant, the edge would be retained in the P-SSN for the single sample d. Each edge in the P-SSN represents unique or specific characteristics of the single sample.
Data availability
The expression profiles from five cancer datasets—breast invasive carcinoma (BRCA), lung adenocarcinoma (LUAD), uterine corpus endometrial carcinoma (UCEC), liver hepatocellular carcinoma (LIHC) and liver hepatocellular carcinoma (KIRC)—were downloaded from the Cancer Genome Atlas (TCGA) (http://cancergenome.nih.gov/) database (Supplementary Table S1). Tumor-adjacent tissue samples were considered as normal samples to be used as reference samples for each kind of cancer. The existing cancer-related genes were downloaded from the Cancer Gene Census (CGC, http://cancer.sanger.ac.uk/cosmic/census) database. Information on the biological pathway was obtained from the Kyoto Encyclopedia of Genes and Genomes (KEGG, https://www.kegg.jp/kegg/pathway.html) database. Eight public single-cell RNA-Seq (scRNA-Seq) datasets were downloaded from the Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo/), ArrayExpress (https://www.ebi.ac.uk/arrayexpress/) and Sequence Read Archive (https://www.ncbi.nlm.nih.gov/sra) databases (Supplementary Information).
Validation of P-SSN on individual features by topological distance, functional distance and cancer gene enrichment
A P-SSN contains individual-specific information from a single sample, and it can be validated using a sample’s somatic mutation genes (SMGs), which is another form of individual-specific information. Therefore, the association between a P-SSN and an SMG in the same sample can be used to verify the sample-specificity of a P-SSN. The topological distance (see Materials and Methods section) is the average distance between genes in a P-SSN and an SMG on a background network, and a topological distance with a significant P-value (<0.05) indicates that the SMG is nearer to the P-SSN than to randomly chosen genes from a single sample. The functional distance (see Materials and Methods section) evaluates the close association between genes in a P-SSN and an SMG in terms of KEGG pathways, and a functional distance with a significant P-value (<0.05) implies that the SMG is more similar in function to the P-SSN than genes randomly chosen from a single sample.
The topological distance and the functional distance between each P-SSN and the corresponding SMG were calculated for every sample from various cancer datasets. In each cancer dataset, all tumor samples had significant P-values of the topological distance between the P-SSN and the corresponding SMG (Figure 3A). This meant that the distance between the P-SSN and the corresponding SMG was significantly short in the network for every single sample, and the P-SSN was, indeed, sample-specific at a network level. From a functional perspective, >89% of samples on average for all cancers were significantly related between each P-SSN and the corresponding SMG based on the functional distance of KEGG pathways (Figure 3A).

Validation of sample-specificity of P-SSNs and prediction of individual-specific driver genes based on P-SSNs. (A) The blue bar was the ratio of significant samples considering the topological distance between the genes in a P-SSN and the genes in the corresponding sample’s SMG. The orange bar was the ratio of significant samples considering the functional distance between the genes in a P-SSN and the genes in the corresponding sample’s SMG. The green bar was the ratio of significant samples in terms of enrichment of the known cancer-related genes for each P-SSN in the CGC database. The x-axis is the four kinds of cancer and the y-axis is the ratio of the significant samples. (B) The percentage of significant samples in terms of enrichment to the known cancer genes in CGC for top 20, 15, 10 potential driver genes and random 20, 15, 10 genes of SMG in the corresponding P-SSN. (C) The percentage of significant samples in enrichment analysis of known driver mutation genes (DMGs) for top 20, 15, 10 potential driver genes and random 20, 15, 10 genes of SMGs in the corresponding P-SSN. (D) The percentage of significant samples in terms of enrichment to the known cancer-related in KEGG cancer pathway for top 20, 15, 10 potential driver genes and random 20, 15, 10 genes of SMG in the corresponding P-SSN. (E) The percentage of significant samples in enrichment analysis of the specific KEGG cancer pathway genes (nonsmall cell lung cancer pathway, breast cancer pathway, endometrial cancer pathway, hepatocellular carcinoma pathway) for top 20, 15, 10 potential driver genes and random 20, 15, 10 genes of SMGs in the corresponding P-SSN in LUAD, BRCA, UCEC and LIHC cancer, respectively. (F) Comparison between P-SSN and other nine traditional methods for driver gene prediction. The figure shows the P-value of enrichment analysis of the predicted driver genes of four tumors by P-SSN and the other nine methods. The enrichment was based on the cancer-related genes in the CGC database and the known DMGs for all cancer. The bold values were the best results in every column.
These results were similar to the conclusions based on topological distance, thus validating the P-SSN as being sample-specific and confirming that it can be used to characterize individual features at a network level.
Each P-SSN was then tested for its enrichment to the known cancer-related genes from the CGC database using a hypergeometric test (see Materials and Methods section), and the results showed that all P-SSNs from tumor samples were significantly enriched for known human tumor genes (Figure 3A). This meant that the P-SSN was associated with the cancer process, and the tumor characteristics of a sample can also be reflected by the P-SSN of a sample.
Predicting individual-specific driver genes based on P-SSNs
A driver mutation is a mutation that can give a cell a growth advantage and lead to positive selection when cancer occurs [20]. DMGs are usually identified from the statistical results of mutations from multiple tumor samples. As mutations of these genes only appear in some tumor samples, traditional DMG cannot be used to explain the cause of tumors for every individual. Here, we propose a method to predict driver genes from a specific tumor sample based on the P-SSN of a focal individual.
The expression levels of many genes are changed in the process of cells transforming from normal to tumor. However, most differentially expressed genes are not drivers, but rather result of tumorigenesis. Real tumor driver genes can be the regulatory genes that control these differentially expressed genes (see Materials and Methods section). Thus, the more differentially expressed genes that can be regulated by a SMG, the greater the possibility that the SMG is a potential driver gene. In this work, differentially expressed genes were identified from normal and tumor for each tumor dataset by Student’s t-test. For each individual sample, the SMG of a sample was mapped onto the sample-specific P-SSN, and the differentially expressed genes were also mapped onto the P-SSN. The first-order neighbors of the SMG on the P-SSN were identified for every SMG, and the differentially expressed genes in the first-order neighborhood were used to search for the second-order neighbors of the SMG in the P-SSN. The second-order neighbors that were connected with differentially expressed first-order neighbors in the P-SSN of the SMG with differential expression were considered as the coverage genes regulated by the SMG. Then, the SMGs were ranked based on the number of their coverage genes in the P-SSN in every sample, and the SMGs with top rank coverage were identified as potential driver genes for each sample. The SMGs with the top 10, 15 and 20 genes in terms of coverage in the P-SSN were considered as potential individual-specific driver genes and were used to validate the potential driver features for each sample.
Functional enrichment analysis was performed for the potential driver genes from each tumor sample to known cancer genes from the CGC database, known DMGs, and the ‘pathways in cancer’ in KEGG and corresponding KEGG-specific cancer pathways. We found that the potential driver genes had higher percentages of significant samples than randomly chosen SMGs in our enrichment analysis to known cancer genes, known DMGs, oncogenes in ‘pathways in cancer’ from KEGG, and oncogenes from the corresponding KEGG-specific cancer pathway for every tumor dataset (Figure 3). For the top 10 potential individual-specific driver genes in samples of BRCA as an example, the potential driver genes of 91.82% of the samples were significantly enriched in the CGC database, while only 32.8% of the samples were significantly enriched in the CGC database using 10 randomly chosen SMGs (analysis repeated 100 times; Figure 3B). The potential driver genes in 90.52% of the samples were significantly enriched to known DMGs, and only an average of 14.96% of the samples were significantly enriched in known DMGs (100 repeats; Figure 3C). The potential driver genes of 87.99% of the samples were significantly enriched in ‘pathways in cancer’ in the KEGG (Figure 3D), and all samples were significantly enriched in the ‘breast cancer’ pathway of the KEGG (Figure 3E). For random SMGs with 100 repeats, an average of 9.92% of the samples was significantly enriched in ‘pathways in cancer’ (Figure 3D) and 2.95% of the samples in the ‘breast cancer’ pathway in the KEGG (Figure 3E). There were similar results for other tumor datasets (Figure 3). The potential driver genes were more significantly enriched in the known tumorigenic factors, e.g. CGC, DMG or tumor pathways, than random SMGs. This meant that more samples supported the potential driver genes as belonging to these known tumorigenic factors or pathways than other SMGs. Thus, the P-SSN method provides an effective strategy for predicting cancer driver genes for a specific patient.
Individual-specific driver genes can be detected for cancer based on the integration of a P-SSN and the corresponding SMG. We collected the top 10 potential driver genes of every tumor sample for different types of cancer, and these genes needed to appear in at least 5% of all samples for this type of tumor to be selected as a driver gene (Supplementary Table S2). The driver genes for a form of cancer were further validated by enrichment analysis of these genes using tumorigenic information, e.g. known cancer-related genes and known DMGs. In other words, this result suggests that the driver genes for cancer tend to be disordered or mutated in tumor samples from various types of cancer.
For validating our method in predicting DMGs, nine other driver gene detection methods—oncodriveCLUST [21], e-Drivers [22], Dendrix [23], COMDP [24], iPAC [25], OncodriverFM [26], ActiveDriver [27], MSEA [28] and NetBox [29]—were used to predict potential driver genes (Supplementary Table S2), and we compared the results with our method. We chose the same number of predicted driver genes from every method for each tumor and determined the enrichment of these predicted driver genes for known cancer-related genes from CGC and DMG. The P-value of the enrichment is shown in Figure 3F for every method of each type of cancer. For BRCA driver gene predictions, the result of Dendrix obtained a P-value of 4.9 × 10−11 in CGC enrichment, and P-SSN’s result was 7.4 × 10−11 (Figure 3F). Although the P-value of Dendrix was superior to that of the P-SSN for the enrichment in CGC, the values were of the same order of magnitude. For LUAD driver gene prediction, NetBox obtained a P-value of 1 × 10−10 in CGC enrichment, and this was superior to the P-SSN’s result of 1.2 × 10−5 (Figure 3F). For other prediction methods, the driver genes from the P-SSN method were significantly more enriched in known cancer-related or driver genes, and P-SSN had superior results under most conditions (Figure 3F). In addition, P-SSN was the only method that could predict the driver genes from a single sample based only on the expression data from a single sample. We also calculated the driver genes identified by P-SSN that were not identified by the other nine tested methods. As shown in Supplementary Figure S1, the proportions of driver genes identified by PSSN but not by the other nine methods for LIHC, LUAD, UCEC and BRCA were 64.71, 74.29, 73.91 and 68.57%, respectively.
Similarity measures between samples by network distance
Similarity measures—which can quantify the similarity or distance between any two samples—are the basis of clustering algorithms. Using different methods for calculating these distances may influence the results of clustering. Many existing distance methods can be used to calculate and measure the similarity between two samples based on gene expression data. Most of these methods only measure the similarity of gene expression between samples, and they often ignore the similarity of regulation among genes.
P-SSN, however, can construct a network based on a single sample, meaning it can thus reflect the regulations among genes in a sample. By employing a P-SSN, we can construct a new similarity evaluation index, i.e. a network distance (see Materials and Methods section) that can evaluate the similarity between two samples at a network level.
For validating the effects of network distance, nine similarity distances (Euclidean, Cosine, Correlation, Braycurtis, Kulsinski, Canberra, Chebyshev, Jaccard and Sqeuclidean) were used to identify the potential subtypes of tumor samples by employing a hierarchical clustering algorithm. The log-rank P-values from survival analysis of these subtypes were used to evaluate the effects of different distance measures in subtype identification (Figure 4J). For survival analysis of BRCA, KIRC and LUAD, the log-rank P-values for the other nine distance measures were nearly nonsignificant, but the log-rank P-value of the network distance was more significant. This means that the network distance had better predictive power than other distance measures (Figure 4J) in identifying subtypes of tumor samples, and the network distance can improve the effectiveness of clustering algorithms in subtyping tumor samples.

The survival curves for subtyping three cancers. (A) P-SSN method for BRCA. (B) P-SSN method for KIRC. (C) P-SSN method for LUAD. (D) ConsensusClusterPlus for BRCA. (E) ConsensusClusterPlus for KIRC. (F) ConsensusClusterPlus for LUAD. (G) tSNE + Kmeans for BRCA. (H) tSNE + Kmeans for KIRC. (I) tSNE + Kmeans for LUAD. (J) Comparison between network distance and other nine traditional distances in the subtype identification for BRCA, KIRC and LUAD. The figure showed the log-rank P-value of survival analysis for the subtypes of three tumors, and the subtypes were obtained by hierarchical clustering algorithm based on different distances. The bold values were the best results in every row.
Distinguishing tumor sample from normal samples, classifying different tumor samples and identifying subtypes of tumor samples using P-SSNs
P-SSN was constructed by computing the sPTCC from every sample, and then the network distance can be calculated between any two samples based on the P-SSN.
The differential edges (sPTCCs) were identified between normal and tumor samples by Student’s t-test. We attempted to distinguish each tumor sample from normal samples by hierarchical clustering using sPTCC. Each tumor sample and all normal samples were mixed together, and the hierarchical clustering method was employed to distinguish the tumor sample from normal samples using the sPTCC values of five differential edges from the P-SSNs of samples.
The clustering method using sPTCC of differential edges can distinguish each tumor sample from normal samples with 100% accuracy for various types of cancer (Figure 5A and Supplementary File 1). For example, there are no samples for LIHC that were wrongly classified (Figure 5A). Herein, sample ‘AAV0’ was used as an example to show the results of hierarchical clustering for a tumor sample and all normal samples in LIHC (Figure 5A). The method based on P-SSN can accurately separate any tumor sample from a group of normal samples (Supplementary File 1).

Classification for cancer phenotypes and different cancer types. (A) Heat maps of five differential edges in P-SSN for a tumor sample and all normal samples in LIHC, UCEC, LUAD and BRCA. Colors of sample labels corresponded to various clusters, i.e. red label corresponded to the tumor sample, and black labels corresponded to the normal samples. The accuracy of classification for a tumor sample and total normal samples in LIHC, UCEC, LUAD and BRCA is all 100% (Supplementary File 1). (B) The clustering dendrogram of four types of cancer (the green bar, the red bar, the blue bar and the purple bar correspond to the UCEC, the BRCA, the LUAD and the LIHC, respectively) was generated by hierarchical clustering based on network distance among P-SSNs. For visualization, we randomly selected 30 samples from four types of cancer for classification. The accuracy of the classification of four types of cancer is 100% (Supplementary Figure S2).
The tumor samples from different types of cancer (BRCA, UCEC, LUAD and LIHC) could be classified by hierarchical clustering based on the network distance. The results showed that the samples from the four cancers were separated into four clusters without false-positive results (Figures 5B and Supplementary Figure S2). The 100% accuracy of the classification implied that individual-based association information from P-SSN was useful and effective for identifying various types of complex diseases.
Next, hierarchical clustering was used to separate the tumor samples into different subtypes based on network distance for LUAD, KIRC and BRCA. Two other popular methods (‘ConsensusClusterPlus’ [30] and ‘tSNE + Kmeans’ [31]) for subtyping tumor/cancer samples were used to compare their effectiveness with our method.
The log-rank P-value of survival analysis was then used to evaluate the effectiveness of different clustering methods for the identification of potential subtypes in different cancer samples (Figures 4 and 6B). The log-rank P-value for our method was 0.01, and this was significant for clustering four subtypes in BRCA, but the P-values for both ‘ConsensusClusterPlus’ and ‘tSNE + Kmeans’ were not significant, with respective values of 0.98 and 0.45 for subtypes in BRCA (Figure 4A, D and G). For KIRC, the P-values for three methods were significant in clustering subtypes, but the P-value of our method (|$3\times{10}^{-10}$|) was more significant than with the other two methods (|$2\times{10}^{-4}$| and |$8\times{10}^{-5}$|; Figure 4B, E and H). For LUAD, there were significant and similar P-values for our method (|$2\times{10}^{-4}$|) and ‘tSNE + Kmeans’ (|$1\times{10}^{-4}$|), but the P-value for ‘ConsensusClusterPlus’ was not significant at 0.86 (Figure 4C, F and I). The results showed that our method had better performance compared with the other two tested methods for LUAD, KIRC and BRCA (Figure 4). In particular, the number of different subtypes for KIRC was four based on the literature [32], but our method found the log-rank P-value of survival analysis for five subtypes (|$3\times{10}^{-10}$|) similar to the P-value of the previously identified four subtypes (Figure 4B and Supplementary Figure S3). Thus, we considered that five subtypes may be more suitable than four for KIRC in this dataset. This means that our network-based clustering method can find new subtypes of cancers and thus has potential applications in medicine or personalized therapy.

The P-SSN/P-CSN clustering based on network distance. (A) The framework of P-SSN/P-CSN clustering based on network distance. (B) The comparison between P-SSN clustering, ConsensusClusterPlus, and tSNE + Kmeans in subtypes identification for LUAD, KIRC and BRCA, evaluated by the log-rank P-value of survival analysis. (C) The comparison between P-CSN and SEURATE, SNN-Clip, SINCERA, tSNE+kmeans, pcaReduce in clustering of scRNA-seq data, evaluated by ARI.
Single-cell clustering by network distance using P-CSNs
Eight published scRNA-seq datasets [12–19] (Supplementary Table S3) were used to test the ability of network distance (see Materials and Methods section) in single-cell clustering. The results demonstrated that the cell-specific network can improve the unsupervised classification of single-cell data. Taking the Biased dataset [12] as an example, only two single cells were wrongly classified, and the accuracy of the classification based on the network distance of P-CSN was 95.92% (Supplementary Figure S4).
To benchmark P-CSNs clustering, five methods (SEURAT [33], SNN-Clip [34], SINCERA [35], pcaReduce [36] and t-SNE + k-means [31]) were used to compare the effectiveness of single-cell clustering (Supplementary Information). We used the Adjusted Rand Index (ARI) to quantify the similarity between the predicted labels obtained by these clustering methods and reference labels. This value ranges from 0 to 1 [37]. P-CSNs performed better than the five previous methods, with a few exceptions (Figure 6C). In the dimension-reduction analysis as shown in Supplementary Figure S5, P-CSN clustering could also distinguish various cell types more clearly than the previous ‘tSNE + kmeans’ method.
Materials and methods
Computing PTCC for a single sample
Here, |${w}_X^{\ast }$| and |${w}_Y^{\ast }$| are regression coefficient vectors for gene X and gene Y with Z, respectively, that is, (k1,b1)T and (k2,b2)T in Figure 1A, and |$\langle a,b\rangle$| means the scalar product between the vectors a and b.
A significance test can be done to evaluate the sPTCC by the methods of single-sample network theory [6]. The distribution of sPTCC was clearly close to the normal-product distribution by simulation (Supplementary Information). The sPTCC can be considered as the perturbation of sample d against the reference samples. This reflects the relation between genes X and Y controlling for Z in sample d by partial correlation based on reference samples.
In this study, when a single sample d was added to the reference samples (with length m), the regression functions of genes X and Y to Z in reference samples were directly used to calculate the residuals of X and Y to Z in sample d (Figure 1A). In other words, the regression coefficients |${w}_X^{\ast }$| and |${w}_Y^{\ast }$| in Equations (4) and (5) were directly used to calculate the residuals |${e}_{xd}$| and |${e}_{yd}$| in sample d (Figure 1A).
Constructing a background network using reference samples
For constructing a P-SSN, a background network is required based on reference samples for reducing algorithm complexity and computational cost. A background network is constructed by calculating the PCCs between any two genes based on the gene expression in reference samples (Figure 2A), and the ‘SciPy’ extension module (http://www.scipy.org/) of the Python programming language was employed to calculate the PCC. In other words, we compute the PCC for each pair of genes in reference samples to retain the significant correlated (PCC > 0.7) gene pairs as edges in the background network.
Constructing sample-specific networks based on PTCC (P-SSN)
The PTCCm between two nodes of an edge can be calculated by Equation (1) controlling for each gene that directly connects to both nodes of an edge on a background network with length m (Figure 2A). Statistical testing of the PCC can be used to test the statistical significance of PTCC via Student’s t-distribution with degrees of freedom m−2 (https://en.wikipedia.org/wiki/Pearson_correlation_coefficient). An edge whose PTCCs controlling for every gene was statistically significant (P-value <0.01) was defined as a significant edge. A reference network was built by keeping significant edges and excluding nonsignificant edges from the background network based on the PTCCs of reference samples (Figure 2A). The reference network usually contains the common direct regulation or interactions of all reference samples and eliminates the indirect regulation or interactions of the background network.
Then, a new sample d is added to the reference samples to form a new combined sample set with length m + 1, and the PTCCm + 1 between two nodes of an edge can be calculated by Equation (6) controlling for each gene that directly connects to both nodes of the edge on the background network (Figure 2B). The perturbed network was built by keeping significant edges and eliminating nonsignificant edges of the background network (Figure 2B).
The sPTCCs of an edge for sample d can be calculated by the difference between PTCCm + 1 and PTCCm in Equation (7) for an edge with the same control nodes (Figure 1B). The statistical significance of sPTCC of an edge in sample d can be evaluated by the theory of SSN [6]. If every sPTCC∆PTCC of an edge with different control genes is statistically significant, the edge is considered as a significant edge to be retained in the P-SSN of sample d (Figure 2C). The final P-SSN for a single sample is built by keeping significant edges and eliminating nonsignificant edges of the sample from the background network based on sPTCCs (Figure 2C). The difference between the perturbed network and the reference network is due to the addition of a new sample d. If the gene expression pattern of a sample d is similar to the reference samples, the changes of most edges in the perturbed network might not be remarkable or significant after adding sample d to the reference samples. On the contrary, if the reference samples and sample d differ significantly in their gene expression pattern, the changes of PTCC on many edges in the perturbed network may be quite significant after adding sample d to the reference samples. Each edge in P-SSN represents unique or specific characteristics of a single sample.
Computing topological distance and functional distance between the SMG and genes of a P-SSN in a single sample
Topological distance is based on the shortest distance between two gene sets on the background network for a single sample [6], where one gene set represents genes in a P-SSN, and the other gene set is the SMGs of a single sample. The topological distance between SMG and P-SSN is defined as the average shortest path between genes in SMG and P-SSN on the background network for a single sample. The statistical test for the topological distance between SMG and P-SSN is done by comparing the topological distance between SMG and P-SSN with that between SMG and random gene sets, and the probability that the topological distance between SMG and P-SSN is greater than that between SMG and random gene set is considered as the significant P-value of the topological distance between SMG and P-SSN in a single sample. The details of the calculation steps for the topological distance are given in reference [6].
Enrichment analysis of a network for known cancer-related genes
We downloaded 719 known cancer-related genes from the CGC database. The enrichment of a P-SSN for cancer-related genes in a single sample was calculated by the hypergeometric test. Significant enrichment was obtained by hypergeometric tests with the P-values <0.05. The proportion of samples significantly enriched in known cancer-related genes was counted for every cancer dataset and was used to explain whether the P-SSN of a single sample was indeed related to known cancer genes.
Network distance between samples/cells for clustering
Gene filtering
There are few genes with zero values in terms of gene expression in bulk RNA-Seq data but many genes with zero values in scRNA-Seq data. For single cell data, the genes with zero values for gene expression in most cells provide little information for clustering and can mislead the results of some methods (e.g. PCC). Removing genes with zero expression values from expression profiles can reduce the dimensions of the data and speed up the clustering procedure. In this method, the genes whose expression values were 0 in >50% of samples/cells were removed from expression profiles and ignored in further calculations (Figure 6A, Supplementary Table S4, Supplementary Figure S6 and Supplementary Information).
Constructing P-SSN/P-CSN
A cell from a scRNA-Seq data set was similar to a sample in a dataset from bulk RNA-Seq. Hence, a cell-specific network can be constructed by sPTCC (Figures 1 and 2) from a single cell to form a P-CSN from scRNA-Seq data. Thus, both P-SSN and P-CSN can be constructed for each sample/cell by sPTCC (Figures 2 and 6A).
Network distance matrix
The network distance matrix D is defined as follows:
|$D={\Big({d}_{ij}\Big)}_{n\times n}$|,
where n is the total number of samples (or cells) (Figure 6A).
Screening for differential expression genes
The gene expression data for each cancer dataset from TCGA (Supplementary Table S1) were divided into two groups, i.e. normal and tumor groups. The differentially expressed genes between normal and tumor were identified by Student’s t-test implemented using the ‘SciPy’ extension module of the Python programming language.
Discussion
In biology and medicine, we are interested in finding out whether there is an association/interaction between two genes, but misleading results could be obtained by traditional correlation/association analysis if another confounding gene was causally related to both focal genes. Therefore, PTCCs were applied to construct a sample-specific network (P-SSN) that removed the indirect interactions from the confounding genes but kept the direct interactions among genes.
Our method indeed requires multisamples as a reference dataset. Given the reference dataset, the network of any single (test) sample can be constructed. It can be applied to various fields, e.g. clinic tests as network biomarkers (NBs). Specifically, given several normal samples as reference samples, we can first construct P-SSN or NBs of each individual with the clinic sample. Then, the P-SSN can be used to analyze the clinic test. If the P-SSN is significant, the individual’s state is significantly different from a normal state (thus possibly disease state). On the other hand, if the P-SSN is not significant (small), it means that the individual’s state is near to the normal state.
P-SSN is the differential network between the network of a single tumor sample and the reference network of normal samples. The greater the difference, the more clearly can the specific features of every single sample be characterized. Theoretically, the reference samples can be chosen from any kind of sample, but choosing those reference samples with common gene expression patterns from the normal samples corresponding to this cancer certainly increases the discriminating power of its P-SSN. To indicate that different reference samples size has small or little impact on the P-SSN construction, the breast cancer data from TCGA was tested. We have randomly chosen 15, 30, 45, 60 and 75 normal samples from 98 normal samples of BRCA as reference samples, and constructed the P-SSN for each tumor sample. The new P-SSNs were then compared with the P-SSNs from all normal samples, and this procedure was repeated 1000 times. The average overlap rate of edges in the new P-SSNs corresponding to the old P-SSNs from all reference samples is 88.21%, 87.65, 89.38, 90.01% and 88.92% for sizes 15, 30, 45, 60 and 75, respectively (Supplementary Table S5). The comparison shows that the size of different reference samples has small or little effect on the calculation of P-SSN.
Although our method based on sPTCC is to construct a single-sample network (SSN), sPTCC is not the partial correlation for a single sample. sPTCC is a differential network between the partial correlation with all (m + 1) samples and the partial correlation with all m samples. Thus, sPTCC reflects the impact of the single sample on the reference samples in terms of network/correlation. However, as shown in the mathematical derivation (Supplementary Information), we can show that under the extreme conditions, i.e. when the reference sample size is sufficiently large and the reference samples are random, sPTCC converges proportionally to the partial correlation of the single sample (Supplementary Information). Thus, P-SSN is not the real correlation network of a single sample, and it mainly reflects the correlation/association change among genes of a single sample against the reference network.
The scRNA-Seq data are usually sparse, and there are many genes with zero value of gene expression due to various reasons. The problem with the missing or zero value in most cells can result in the wrong conclusions of some methods (e.g. correlation coefficient or PCC method). Removing those genes with zero expression is widely used in scRNA-Seq analysis. Besides, removing the genes with zero expression values from expression profiles can also reduce the dimension of the data and speed up the clustering procedure from the computational viewpoint. In this work, the Yan dataset [13] was used to test the robustness of the results in different choices of genes filter. We have taken different measures for genes filter, and the genes were removed from all cells if zero value of gene expression was >50% (Measure 1), 60% (Measure 2), 70% (Measure 3), 80% (Measure 4) and 90% (Measure 5) of all cells, respectively. We used the ARI to quantify the similarity of clustering in different measures (Supplementary Table S4). We also did the dimension-reduction analysis of various clustering measures (Supplementary Figure S6A, B, C, D and E correspond Measure 1, 2, 3, 4 and 5, respectively). The comparison results show that Measure 1, i.e. removing the genes whose expression value was 0 in >50% cells, did not affect the clustering result.
To compare with the existing subtypes of BRCA, subtyping P-SSN based on network distance can obtain a similar result with existing tumor subtypes of BRCA (Supplementary Figure S11, Supplementary Table S7 and Supplementary Information).
On the other hand, there are >20 000 genes in humans, and the whole-correlation network for humans has over 200 000 000 potential edges. Hence, it is computationally intense to construct such a network. However, most edges in such whole-correlation networks are actually indirect connections/interactions in a molecular network. For reducing the computational and storage burden, we constructed a background network from reference samples to filter indirect edges/interactions, in this work.
The SMG and P-SSN are sample-specific information, and the differentially expressed genes between normal and tumor samples are tumor-specific information. Thus, the integration of SMG, P-SSN and differentially expressed genes can aid in identifying the potential driver genes in a specific sample. And the performance of P-SSN in network construction and driver gene prediction is usually better than SSN and gene expression (Supplementary Figures S8 and S9 and Supplementary Information).
The sPTCC measures the direct correlations between a single sample and the reference samples and quantifies the change in the degree of correlation by adding an individual sample to a group of reference samples. If a single sample has significant changes in the correlation of two genes of the reference samples, this implies that the regulation of these two genes is inconsistent with the reference samples. This inconsistency in regulation may be caused by differential expression or mutation of either or both two genes. Therefore, sPTCC detection is a sensitive method that can identify potential cancer-related genes, and the method is superior to traditional detection via differential expression. In other words, those genes identified by traditional testing may play an important role in diseases at a network level rather than at the gene expression level. This phenomenon is similar to non-RNAs that are regarded as the ‘dark matter’ in gene sequences [6, 8, 38]. Thus, our method manifests that these nondifferential genes may have rich information concerning diseases and are the ‘dark genes’ in terms of expression [6, 8, 38].
It should be declared that the P-SSN is a theoretical network constructed using a single sample that quantifies the changing degree of correlations of a single sample against the reference samples. The P-SSN reflects the differences between normal and cancer samples from a regulation, interaction or network perspective. A traditional molecule networks are aggregated networks on a multiple samples’ basis, while a P-SSN is a specific network based on a single sample. Therefore, compared to traditional molecular biomarkers, it can be applied to construct NBs [6, 8] or dynamic network biomarkers [39–43] for individual diagnosis and prediction, and be a potential tool in personalized medicine.
A method to construct an SSN or sample-specific network based on PTCC measures the perturbation degree of a single sample against reference samples from a network perspective.
The results of the application of P-SSN to cancer data from TCGA not only validated the effectiveness of our method but also predicted individual-specific disease genes, identified cancer phenotypes and further classified subtypes of cancer.
By extending P-SSN from a sample level to a cell level, P-CSN improves the unsupervised classification of scRNA-seq data.
Funding
National Key R&D Program of China (No. 2017YFA0505500), National Natural Science Foundation of China (NSFC) (Nos. 61403363, 31930022, 31771476), Key Project of Natural Science of Anhui Provincial Education Department (No. KJ2020A0018, KJ2016A002), Key project of teaching and research of Anhui Finance and Economics University (No. acjyzd201606), Key research projects of Humanities and Social Sciences in Colleges and Universities of Anhui Province (No. SK2017A0848) and Shanghai Municipal Science and Technology Major Project (No. 2017SHZDZX01).
Yanhong Huang is a master’s student in the Institute of Statistics and Applied Mathematics, Anhui University of Finance & Economics, Bengbu 233030, China, and School of Mathematics and Statistics, Shandong University at Weihai, Weihai 264209, China. Her research interests include computational systems biology and gene regulatory networks.
Xiao Chang is an associate professor in the Institute of Statistics and Applied Mathematics, Anhui University of Finance & Economics, Bengbu 233030, China. Her research interest includes bioinformatics and applied statistics in complex diseases.
Yu Zhang is a master’s student in the School of Mathematics and Statistics, Shandong University at Weihai, Weihai 264209, China. Her research interests include computational biology and bioinformatics and biological sciences.
Luonan Chen is a professor and executive director in a Key Laboratory of Systems Biology, Center for Excellence in Molecular Cell Science, Shanghai institute of Biochemistry and Cell Biology, Chinese Academy of Sciences, Shanghai 200031, China, Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming 650223, China, Shanghai Research Center for Brain Science and Brain-Inspired Intelligence, Shanghai 201210, China, and Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Hangzhou 310024, China. His interests include systems biology, computational biology, bioinformatics and applied mathematics.
Xiaoping Liu is a professor in the School of Mathematics and Statistics, Shandong University at Weihai, Weihai 264209, China. His research studies are performed in the fields of computational systems biology and bioinformatics with a special focus on complex disease.
References
Author notes
Yanhong Huang and Xiao Chang contributed equally to this paper as the first authors.