Graphlet Laplacians for topology-function and topology-disease relationships

Abstract

Motivation

Laplacian matrices capture the global structure of networks and are widely used to study biological networks. However, the local structure of the network around a node can also capture biological information. Local wiring patterns are typically quantified by counting how often a node touches different graphlets (small, connected, induced sub-graphs). Currently available graphlet-based methods do not consider whether nodes are in the same network neighbourhood. To combine graphlet-based topological information and membership of nodes to the same network neighbourhood, we generalize the Laplacian to the Graphlet Laplacian, by considering a pair of nodes to be ‘adjacent’ if they simultaneously touch a given graphlet.

Results

We utilize Graphlet Laplacians to generalize spectral embedding, spectral clustering and network diffusion. Applying Graphlet Laplacian-based spectral embedding, we visually demonstrate that Graphlet Laplacians capture biological functions. This result is quantified by applying Graphlet Laplacian-based spectral clustering, which uncovers clusters enriched in biological functions dependent on the underlying graphlet. We explain the complementarity of biological functions captured by different Graphlet Laplacians by showing that they capture different local topologies. Finally, diffusing pan-cancer gene mutation scores based on different Graphlet Laplacians, we find complementary sets of cancer-related genes. Hence, we demonstrate that Graphlet Laplacians capture topology-function and topology-disease relationships in biological networks.

Availability and implementation

http://www0.cs.ucl.ac.uk/staff/natasa/graphlet-laplacian/index.html

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Systems biology is flooded with large scale ‘omics’ data. Genomic, proteomic, interactomic, metabolomic and other data, are typically modelled as networks (also called graphs). This abundance of networked data started the fields of network biology, allowing us to uncover molecular mechanisms of a broad range of diseases, such as rare Mendelian disorders (Smedley et al., 2014), cancer (Leiserson et al., 2015) and metabolic diseases (Baumgartner et al., 2018). In personalized medicine, network analysis is applied to the tasks of biomarker discovery (Li et al., 2015), patient stratification (Gligorijević et al., 2016) and drug repurposing (Durán et al., 2018).

Many network analysis methods use the Laplacian matrix as it captures the global wiring patterns of a network (see Section 1.1). These methods include spectral clustering, spectral embedding and network diffusion. Each of these families of methods relies on the fact that the eigendecomposition of the Laplacian matrix naturally uncovers network clusters (see Section 1.2). Applications of spectral embedding include visualizing genetic ancestry (Lee et al., 2010) and pseudo-temporal ordering of single-cell RNA-seq profiles (Campbell et al., 2015). Applications of spectral clustering include detection of functional sub-network modules in single-cell genomic networks (Bartlett et al., 2017) and identification of functional modules in co-regulatory networks (Luo et al., 2018). Network diffusion methods are widely used for protein function prediction (Cao et al., 2013) and discovery of disease genes and disease modules, see Cowen et al. (2017) for a full review.

1.1 Laplacian matrix definition

The Laplacian matrix captures the global structure of a network: for each node it captures the adjacency relationship with other nodes (i.e. who are its neighbours) and its degree centrality (a measure of the node’s importance in the network). In a network, G(V, E), two nodes, u and v, are adjacent if there exists an edge

(u, v) \in E

connecting them. The adjacency of all nodes in graph G is represented in an n × n symmetric adjacency matrix A:

A (u, v) = {\begin{matrix} 1 & if (u, v) \in E \\ 0 & otherwise . \end{matrix}

(1)

The neighbourhood of a node is defined as the set of nodes adjacent to it. A node’s degree is the size of its neighbourhood, or equivalently, the number of nodes that are adjacent to it. The degree matrix of G is defined as the diagonal matrix, D, where D(u, u) is equal to the degree of node u:

D (u, v) = {\begin{matrix} deg (u) & if u = v \\ 0 & otherwise . \end{matrix}

(2)

The Laplacian, $L$ ⁠, is defined as $L = D - A .$ The symmetrically normalized Laplacian, L^sym, is defined as $L^{sym} = D^{- 1 / 2} L D^{- 1 / 2}$ ⁠.

1.2 Laplacian matrix eigendecomposition

Spectral clustering, spectral embedding and network diffusion analyse networks based on the eigendecomposition of the Laplacian matrix naturally uncovering densely connected sub-networks present in the network. The eigendecomposition of

L

⁠, is defined as

L U = U Λ

⁠, where the ith column of the n × n matrix U is known as the ith eigenvector and Λ is the diagonal matrix whose diagonal elements are the corresponding eigenvalues. Uncovering densely connected sub-networks present in the network (i.e. network clustering), can be defined as solving the ratio-cut problem: to cut a network into d similar-sized sub-networks whilst minimizing the number of edges being cut. An approximation of this problem is formulated as follows:

\underset{U \in R^{n \times d}}{minimize} trace (U^{T} L U) subject to: U^{T} U = I,

(3)

where each column of U is a normalized indicator vector assigning each node to one of the d sub-networks. This problem is solved by d normalized eigenvectors of

L

associated with d smallest eigenvalues, illustrating how the Laplacian matrix captures clusters present in the network.

1.3 Matrix alternatives to the Laplacian

Laplacian matrices only capture direct interactions between nodes. To capture the influence of long-range interactions between nodes, Estrada (2012) proposed the k-path Laplacian by generalizing the concepts of adjacency and degree. The k-path Laplacian defines a pair of nodes u and v to be k-adjacent if the shortest path distance between them is equal to k. Analogously, k-path degree,

\deg_{k} (u)

⁠, generalizes the concept of the degree to the number of length k shortest paths that have node u as an endpoint. The k-path Laplacian,

L_{k}^{P}

⁠, is defined as:

L_{k}^{P} (u, v) = {\begin{matrix} - 1 & if d (u, v) = k \\ \deg_{k} (u) & if u = v \\ 0 & otherwise . \end{matrix}

(4)

Vicus is an alternative to the Laplacian that captures the intricacies of a network’s local structure (Wang et al., 2017) based on network label diffusion. Label diffusion is defined as P = BQ, where the n × d matrix Q assigns the n nodes of network G to one of d possible labels (for labelled nodes), B is an n × n diffusion matrix, and the reconstructed matrix P is an n × d matrix used for predicting labels for unlabelled nodes. To give Vicus its ‘local’ interpretation, the label diffusion process determining B is constrained to diffuse information of each node only to its direct neighbourhood. Under given assumptions and defining Vicus as $L^{V} = (I - B^{T}) (I - B)$ ⁠, it was shown that Q can be learned as the eigenvectors of $L^{V}$ ⁠. As Q captures the local connectivity between nodes that is implied by the ‘localized’ diffusion matrix B and can be computed as the eigenvectors of $L^{V}$ ⁠, Vicus is interpreted as a Laplacian matrix. Vicus is applied to protein module discovery and ranking of genes for cancer subtyping (Wang et al., 2017).

1.4 Problem

All Laplacian-based applications are based on the same underlying principle of guilt by association, inferring information on a given node based on the group of nodes it is most tightly connected with. However, alternative approaches have inferred information on a given node based on the shape of its interaction pattern, typically independent of the identity of the nodes it is interacting with. These alternative approaches are based on graphlets, small connected sub-graphs (see Section 2.1 for a formal definition), to capture the local topology around a node in a network. For example, graphlet-based methods have been applied to predict protein function (Davis et al., 2015; Milenković and Pržulj, 2008) and to identify new cancer genes (Milenković et al., 2010) directly from the similarities in terms of their interaction patterns in PPI networks.

Alternatives to the Laplacian matrix that take local topology into account have been suggested. The k-path Laplacian captures the influence of long-range interactions between nodes, but ignores short-range interactions. Vicus captures local topology around each node as the strength of its connection to its neighbours after applying a localized label diffusion algorithm. Although Vicus is focused on capturing local topology, it lacks interpretability from a structural perspective.

1.5 Contribution

We introduce the Graphlet Laplacian, allowing us to analyse nodes based on their network neighbourhoods, whilst restricting the pattern of their interactions to that of a pre-specified graphlet. Hence, each graphlet (Fig. 1A) has its own corresponding Graphlet Laplacian. We generalize spectral embedding, spectral clustering and network diffusion to utilize Graphlet Laplacians. Through graphlet-generalized spectral clustering of model networks and biological networks, we show that different Graphlet Laplacians capture different local topologies. By applying graphlet-generalized spectral embedding, we visually demonstrate that Graphlet Laplacians capture biological functions as well. We quantify this through graphlet-generalized spectral clustering analysis. We show that Graphlet Laplacians are not only as biologically relevant as alternative Laplacian matrices, but also capture complementary biological functions. Finally, by graphlet-generalized diffusing of pan-cancer gene mutation scores on the human PPI network, we show that Graphlet Laplacians capture complementary disease mechanisms. We compare our results against those based on alternative state the art Laplacian matrices. A similar methodology based on network motifs was presented by Benson et al. (2016) for spectral clustering of directed networks.

Illustration graphlets and Graphlet Laplacians. Node a is coloured in green throughout. The graphlet counts of node a for graphlet G0 and G1 are coloured in red throughout. (A) All graphlets with up to 4 nodes, labelled G0 to G8. (B) A dummy network. (C) A vector of graphlet counts describing the local topology of node a in the example network, G. Node a touches graphlet G0 via edge (a, b). Node a touches graphlet G1 twice, via paths a-b-c and a-b-e. (D) The Graphlet Laplacians for graphlets G0 and G1, applied on the network, G, shown in B. The diagonal elements correspond to the graphlet counts of each node; e.g. L0G(1,1) is equal to 1, the number of times node a touches graphlet G0, L1G(1,1) is equal to 2, the number of times node a touches graphlet G1. The off-diagonal elements correspond to the number of times two nodes touch a given graphlet together, scaled by size(Gk)−1. L0G(1,2)=−1, as a and b form G0 once and size(G0)−1=1. L1G(1,2)=−1, as a and b form G1 twice and size(G1)−1=2

Fig. 1.

Illustration graphlets and Graphlet Laplacians. Node a is coloured in green throughout. The graphlet counts of node a for graphlet G₀ and G₁ are coloured in red throughout. (A) All graphlets with up to 4 nodes, labelled G₀ to G₈. (B) A dummy network. (C) A vector of graphlet counts describing the local topology of node a in the example network, G. Node a touches graphlet G₀ via edge (a, b). Node a touches graphlet G₁ twice, via paths a-b-c and a-b-e. (D) The Graphlet Laplacians for graphlets G₀ and G₁, applied on the network, G, shown in B. The diagonal elements correspond to the graphlet counts of each node; e.g. $L_{0}^{G} (1, 1)$ is equal to 1, the number of times node a touches graphlet G₀, $L_{1}^{G} (1, 1)$ is equal to 2, the number of times node a touches graphlet G₁. The off-diagonal elements correspond to the number of times two nodes touch a given graphlet together, scaled by $size (G_{k}) - 1$ ⁠. $L_{0}^{G} (1, 2) = - 1$ ⁠, as a and b form G₀ once and $size (G_{0}) - 1 = 1$ ⁠. $L_{1}^{G} (1, 2) = - 1$ ⁠, as a and b form G₁ twice and $size (G_{1}) - 1 = 2$

Open in new tab Download slide

2 Materials and methods

2.1 Graphlet Laplacian definition

We generalize the Laplacian matrix so that it can capture the local topology of a network around a node in a broader sense than the identity of its direct neighbours. One of the most sensitive methodologies to capture network topology around a node are graphlets: small, connected, non-isomorphic, induced sub-graphs of a large network (Pržulj et al., 2004). All graphlets up to four nodes are depicted in Figure 1A. To illustrate how graphlets can be used to quantify the local topology around a node, consider node a in the dummy network presented in Figure 1B. Graphlet G₁ (i.e. a three node path) that touches node a can be found in this dummy network twice: via paths a-b-c and a-b-e. Node a is said to touch G₁ twice. By making these counts for a given node over all graphlets, the local network topology for a given node can be quantified by means of a vector, as illustrated for node a in Figure 1C. Here, we see that node a can be found as part of an edge once (G₀), as part of a three node path twice (G₁), never as part of a triangle (G₂) and so on.

Having established that by counting how often a node touches graphlets can be used to quantify its local topology, we go on and generalize the concept of the Laplacian to that of a Graphlet Laplacian by generalizing the definitions of adjacency and degree to ones based on graphlets. First, we define two nodes u and v of G to be graphlet-adjacent with respect to a given graphlet, G_k, if they simultaneously touch G_k. Going back to our previous example, we find that nodes a and b are graphlet-adjacent w.r.t. graphlet G₁ twice, as G₁ can be induced on the dummy network twice: via paths a-b-c and a-b-e, each time including both nodes a and b. Similarly, nodes a and c and nodes a and e are graphlet-adjacent only once, w.r.t. graphlet G₁.

Given this extended definition of adjacency, we define the graphlet-based adjacency matrix as:

A_{k} (u, v) = {\begin{matrix} a_{uv}^{k} & if u \neq v \\ 0 & otherwise, \end{matrix}

(5)

where

a_{uv}^{k}

is equal to the number of times nodes u and v are graphlet-adjacent w.r.t graphlet G_k. Analogously, the graphlet degree generalizes the node degree as the number of times node u touches graphlet G_k. We extend the degree matrix to the Graphlet Degree matrix for graphlet G_k,

D_{k} (u, v) = {\begin{matrix} d_{u}^{k} & if u = v \\ 0 & otherwise, \end{matrix}

(6)

where

d_{u}^{k}

is the number of times node u touches graphlet G_k. For an underlying graphlet G_k, we define the Graphlet Laplacian

L_{k}^{G}

⁠, as:

L_{k}^{G} = D_{k} - (A_{k} / θ) .

(7)

where

θ = size (G_{k}) - 1

⁠. As opposed to the Laplacian simply capturing for each node its neighbours, the Graphlet Laplacian

L_{k}^{G}

captures for each node how strongly (i.e. frequently) each node is connected in the shape of G_k with each of the other nodes.

L_{0}^{G}

and

L_{1}^{G}

are illustrated in Figure 1D. Finally, note that the Graphlet Laplacian for graphlet G₀,

L_{0}^{G}

⁠, is equivalent to the standard Laplacian,

L

⁠.

2.2 Graphlet Laplacian properties

To allow for an easy interpretation of the Graphlet Laplacian for each graphlet, G_k, we introduce the two-step transformation function, T, which maps graph G to its Graphlet Laplacian representation: $T (G, G_{k}) = L_{k}^{G}$ ⁠. First, T converts $G = {V, E}$ to a weighted network $G' = {V, E'}$ ⁠, where the weight of each edge (u, v) in $G'$ corresponds to $a_{uv}^{k} / (size (G_{k}) - 1)$ measured in G. Next, T converts $G'$ to its standard Laplacian representation. This shows that the Graphlet Laplacian can be interpreted as the Laplacian of an undirected weighted network. Therefore, the Graphlet Laplacian retains the following key properties of the Laplacian:

The Graphlet Laplacian, $L_{k}^{G}$ ⁠, is symmetric and positive semi-definite.
The smallest eigenvalue is 0 and the corresponding eigenvector is the constant vector $1$ ⁠. eigenvector
The Graphlet Laplacian has n non-negative, real-valued eigenvalues: $0 = λ_{1}^{k} \leq λ_{2}^{k}, \dots, λ_{n}^{k}$ ⁠.
The multiplicity of the eigenvalue 0 equals the number of connected components in $G'$ ⁠, which we refer to as graphlet-based components.

2.3 Spectral embedding

Spectral embedding embeds a network in a lower dimensional space, placing nodes close in space if they share many neighbours. Here, we generalize the Laplacian Eigenmap embedding algorithm (Belkin and Niyogi, 2003) so that two nodes are embedded close in space if they frequently simultaneously touch a given graphlet. Given an unweighted network G with n nodes, we find a low-dimensional embedding,

Y = [y_{1}, \dots, y_{n}] \in R^{d \times n}

such that if nodes u and v are frequently graphlet-adjacent with respect to graphlet G_k, then

y (u)

and

y (v)

are close in the d-dimensional space by solving:

\begin{matrix} \underset{Y}{minimize} & \sum_{u = 1}^{n} \sum_{v = 1}^{n} A_{k} (u, v) {‖ y_{u} - y_{v} ‖}^{2} \\ subject to : & Y D_{k} 1 = 0 and Y D_{k} Y^{T} = I, \end{matrix}

(8)

where A_k is the graphlet-based adjacency matrix of G for graphlet G_k, D_k is the graphlet-based degree matrix of G for graphlet G_k. The columns of Y are found as the generalized eigenvectors associated with the 2nd to

(d + 1) th

smallest generalized eigenvalues solving

Y L_{k}^{G} = Λ Y D_{k}

⁠, where Λ is the diagonal matrix with the generalized eigenvalues along its diagonal.

2.4 Spectral clustering

Spectral clustering uncovers groups of nodes in a network that form densely connected network clusters. By generalizing spectral clustering to Graphlet Laplacian-based spectral clustering, we are able to identify network components that are densely connected with respect to a given graphlet. Many different variations of spectral clustering exist (Von Luxburg, 2007). Aiming for a balanced clustering, we generalize normalized spectral clustering as defined by Ng et al. (2002) to use different Laplacians including Graphlet Laplacians, all denoted by a generic $L$ in algorithm 1. We skip the normalization step (i.e. step 1) for Vicus, as Vicus is already normalized.

Algorithm 1.

Normalized spectral clustering

Input A network G with n nodes, and a number of clusters d.

Output d clusters of the n nodes of G.

1: Compute the Laplacian matrix, $L$ ⁠, and corresponding diagonal matrix, D, for the network G.
2: Compute the normalized Laplacian as: $L^{sym} = D^{- 1 / 2} L D^{- 1 / 2}$ ⁠.
3: Compute the d eigenvectors of $L^{sym}$ associated with its d smallest eigenvalues: $Y = [y_{1}, \dots, y_{d}] \in R^{n \times d}$ ⁠.
4: Normalize Y so that each row has unit norm.
5: Cluster the n points ${y}_{u = 1}^{n}$ into d groups using k-means.

For each network, we determine the numbers of clusters, d, by using the rule of thumb: $d \approx \sqrt{n / 2}$ (Kodinariya and Makwana, 2013). In the Supplementary Section S1, we present the justification for this approach, based on inspection of the spectra of different Laplacian matrices of each network. Because of the heuristic nature of spectral clustering, we perform 20 runs for each clustering and consolidate them into a single clustering applying ensemble clustering (Ghosh and Strehl, 2002).

2.5 Network diffusion

Network diffusion refers to a family of related techniques, which propagate information on nodes through the network. Here, we will focus on generalizing the diffusion kernel to graphlet-based diffusion kernel. The diffusion kernel is often called the ‘heat kernel’, as it can be viewed as describing the flow of heat originating from the nodes across the edges of a graph with time. In network biology, nodes typically represent genes and ‘heat’ on a node represents experimental measurements. For a set of n nodes, these measurements are encoded in vector

P_{0} \in R^{n}

⁠. Information is diffused as follows: P = HP₀, where H is a diffusion kernel. For a given graphlet G_k, we define the graphlet-based diffusion kernel,

H_{α}^{k}

⁠, as:

H_{α}^{k} = e^{- α L_{k}^{G}},

(9)

where the parameter

α \in R

controls the level of diffusion. This way, diffusion of information on nodes propagates between nodes restrained by how often they form a given graphlet G_k together.

2.6 Topological dissimilarity of networks

2.6.1 Graphlet correlation distance

The Graphlet Correlation Distance (GCD-11) is the current state of the art heuristic for measuring the topological distance between networks (Yaveroǧlu et al., 2014). First, the global wiring pattern of a network is captured in its Graphlet Correlation Matrix (GCM), an 11 × 11 symmetric matrix comprising the pairwise Spearman’s correlations between 11 different graphlet-based counts over all nodes in the network. The GCD between two networks is computed as the Euclidean distance of the upper triangle values of their GCMs.

2.6.2 Non-graphlet based network descriptors

The difference between the following non-graphlet based network descriptors can be used to measure the distance between two networks:

The degree distribution is the distribution of node degrees over all nodes. It is summarized as a vector of counts, i.e. the kth value is the number of nodes that have degree k. To measure the distance between two networks, this vector is first rescaled to reduce the contribution of higher degree nodes. The pairwise distance between two networks is the Euclidean distance between their rescaled degree distribution vectors. For more details, see (Yaveroǧlu et al., 2014).
The diameter of a connected network is the maximum shortest path distance that is observed among all node pairs. The distance between two networks is the absolute difference of their diameters.
The average clustering coefficient is the total number of three node cliques in the network over the number of possible three node cliques in the network. The distance between two networks is the absolute difference of their average clustering coefficient.

2.7 Cluster enrichment analysis

To assess if a cluster of genes is biologically relevant, we measure if it is statistically significantly enriched in a specific biological annotation term by applying the hyper-geometric test. That is, we consider each cluster as a ‘sampling without replacement’, in which each time we find a given annotation, we count that as a ‘success’. The probability of observing the same or higher enrichment (i.e. successes) of the given annotation purely by chance is equal to:

p = 1 - \sum_{i = 0}^{X - 1} (\begin{matrix} K \\ i \end{matrix}) (\begin{matrix} M - K \\ N - i \end{matrix}) / (\begin{matrix} M \\ N \end{matrix}),

(10)

where N is the number of annotated genes in the cluster, X is the number of genes annotated with the given annotation in the cluster, M is the number of annotated genes in the network and K is the number of genes annotated with the given annotation in the network. An annotation is considered to be statistically significantly enriched if its enrichment P-value is lower than or equal to 5% after application of the Benjamini and Hochberg correction for multiple hypothesis testing.

2.8 Data

2.8.1 Real biological network data collection

We create three types of molecular interaction networks for human and baker’s yeast (Saccharomyces cerevisiae) by collecting the following data: experimentally validated protein–protein interactions (PPIs) from IID version 2018-05 (Kotlyar et al., 2016) and BioGRID version 3.4.161 (Stark et al., 2006), genetic interactions from the same version of BioGRID and gene co-expressions from COXPRESdb version 6.0 (Okamura et al., 2015).

2.8.2 Random model network generation

For each of the following 8 random network models, we generate 10 networks containing 2000 nodes at edge density of 1.5%: Erdős–Rényi random graphs (ER) (Erdős Paul and Rényi Alfréd, 1959), generalized random graphs with the degree distribution matching to the input graph (ER-DD) (Newman, 2010), Barabási–Albert scale-free networks (SF) (Barabási and Albert, 1999), geometric random graphs (GEO) (Penrose, 2003), geometric graphs that model gene duplications and mutations (GEO-GD) (Pržulj et al., 2010), stickiness-index based networks (Sticky) (Pržulj and Higham, 2006), popularity–similarity optimization graphs (PSO) (Papadopoulos et al., 2012) and non-uniform PSO graphs (nPSO) (Muscoloni and Cannistraci, 2018). A summary on the basic properties of these networks and how to generate them can be found in Supplementary Section S2.1.

2.8.3 Biological annotations

For each gene in our biological networks, we collect the most specific experimentally validated biological process annotations (BP), cellular component annotations (CC) and molecular function annotations (MF) present in the Gene Ontology (GO) (Ashburner et al., 2000).

2.8.4 Cancer gene annotations

We collect the pan-cancer gene mutation frequency scores computed by Leiserson et al. (2015) for the purpose of detecting pan-cancer disease modules. Leiserson et al. (2015) collected raw pan-cancer mutation data, such as SNV’s, indels and CNA’s, from the TCGA database (Kandoth et al., 2013). These data were filtered to exclude statistical outliers and include only the samples (corresponding to a patient) for which SNV and CNA data were available. The resulting dataset contains mutations on 11 565 genes across 3110 patients in cancers across 20 different tissues. Additionally, we collect the sets of known cancer driver genes in all available tissues from IntOGen (Gonzalez-Perez et al., 2013) and Cosmic (Futreal et al., 2004).

3 Results and discussion

We investigate the potential usage of Graphlet Laplacians to analyse network data via embedding, clustering and network diffusion experiments. We consider Graphlet Laplacians for graphlets with up to four nodes. We compare our results to the state of the art Laplacian matrices: the standard Laplacian, the k-path Laplacian and Vicus. We consider path lengths up to three for the k-path Laplacian, corresponding to the maximum size of the considered graphlets underlying the Graphlet Laplacian. We set Vicus’ diffusion parameter to 0.9, as this value is recommended in the original paper (Wang et al., 2017) and leads to the largest number of enriched functions (see Section 3.2).

3.1 Graphlet Laplacians capture different local topologies

While the standard Laplacian simply captures the direct neighbourhoods of nodes and can be used to cluster densely connected nodes together, the graphlet-based neighbourhood captured by our Graphlet Laplacian allows for clustering of nodes that strongly participate in a given graphlet of interest. Because different graphlets capture different local topologies around nodes in a network (e.g. G₃ involve paths while G₈ involves cliques), clusters obtained by using different Graphlet Laplacian are expected to possess different topological features, which we assess as follows.

To assess if two graphlet Laplacians, $L_{i}^{G}$ and $L_{j}^{G}$ ⁠, capture different topologies, we apply each Laplacian to cluster nodes in a network using Graphlet Laplacian-based spectral clustering. The resulting clusters are used to partition the network into two sets of sub-networks, by inducing the sub-networks from each clustering. $L_{i}^{G}$ and $L_{j}^{G}$ capture different topologies if the corresponding sets of sub-networks have significantly different topology, which we measure by the overlap of two distributions: the distribution of GCD-11 distances between the sub-networks produced from $L_{i}^{G}$ with the sub-networks produced from $L_{j}^{G}$ and distribution of GCD-11 distances between the sub-networks produced from $L_{i}^{G}$ ⁠. The two Graphlet Laplacians capture statistically significantly different topologies if the Wilcoxon–Mann–Whitney U-test (MWU) between the two distributions of distances is lower than or equal to 5% (see Fig. 2 for the case of $L_{0}^{G}$ and $L_{4}^{G}$ ⁠). For each type of model network, we perform this test ten times and report the least significant P-value for each pairwise comparison of Graphlet Laplacian-based sub-networks. We also considered the following non-graphlet based network distance measures: degree distribution distance, diameter distance and average clustering coefficient distance (see Section 2.6.2). In general and independent of the network distance measure used, clusters obtained from different Graphlet Laplacians are typically statistically significantly topologically different at the 5% significance level. This is true across all of our biological networks and most of our model networks, with some exceptions in geometric models which are known to have homogeneous structure. Thus, Graphlet Laplacians not only capture network cluster that have different topology, but can also be used to measure the structural homogeneity of a given network.

Fig. 2.

Comparison of topological distance distributions between sub-networks captured by two different Graphlet Laplacians in the human PPI network. The distribution of GCD-11 distances between the sub-networks from $L_{0}^{G}$ (in blue) is statistically significantly different from the distribution of GCD-11 distances between the sub-networks from $L_{0}^{G}$ and the sub-networks from $L_{4}^{G}$ (in red) with MWU P-values ≤5%. This means that $L_{0}^{G}$ and $L_{4}^{G}$ capture different topologies in the human PPI network

Open in new tab Download slide

We illustrate this by investigating how the parameters of the PSO/nPSO model networks influence the topological homogeneity of the networks generated, see Supplementary Figures S7–S9. At a low temperature (i.e. nodes are connected to nearby nodes) and low number of communities (i.e. the angle of each node is sampled from a univariate Gaussian), both types of networks are homogeneous. As temperature increases, newly added nodes are more uniformly connected in space (i.e. are more randomly connected), making the generated PSO and nPSO networks closer to ER networks, thus breaking the homogeneous structure. In nPSO networks, increasing the number of communities increases the homogeneity of the networks. This effect is stronger at lower temperatures.

3.2 Different Graphlet Laplacians capture different biological functions

In biological networks, genes having similar functions tend to be densely connected to each other (Hartwell et al., 1999), which is why spectral clustering based on the standard Laplacian matrix has been used to uncover functional regions in networks (Bu et al., 2003). Alternatively, graphlets have been used to show that functionally related genes tend to be similarly wired, independent of them being densely connected (Milenković and Pržulj, 2008). As graphlet Laplacians capture both types of information, they should also capture biological functions.

To informally visualize this, we perform spectral embedding. We focus on the embedding of the yeast GI network, for which we use 14 core biological process annotations defined by Costanzo et al. (2016). We illustrate the spectral embedding of the symmetrically normalized $L_{3}^{G}$ Graphlet Laplacian in Figure 3. The embeddings of the other Laplacian matrices of the yeast GI network can be found in the Supplementary Section S3. As seen in Figure 3, the spectral embedding of $L_{3}^{G}$ correctly groups and separates the biological processes of ‘nuclear cytoplasmic transport’, ‘metabolism/mitochondria’, ‘Golgi/endosome/vacuole sorting’ and ‘Chrom. seg./kinetoch./spindle/micro tub.’. In the Supplementary Material, we illustrate that Vicus and the Laplacian fail to find any grouping at all, placing all of the nodes in the same dense cluster. Embeddings based on $L_{2}^{P}$ and $L_{3}^{P}$ succeed in separating different genes into different clusters, but without grouping them in a biologically meaningful way.

Capturing biological functions with Graphlet Laplacian L3G. 2D spectral embedding of the yeast GI network using the Graphlet Laplacian for G3. Points represent genes and are colour-coded with 14 core biological process annotations defined by Costanzo et al. (2016)

Fig. 3.

Capturing biological functions with Graphlet Laplacian $L_{3}^{G}$ ⁠. 2D spectral embedding of the yeast GI network using the Graphlet Laplacian for G₃. Points represent genes and are colour-coded with 14 core biological process annotations defined by Costanzo et al. (2016)

Open in new tab Download slide

Next, we aim to quantify this result by measuring the difference in functions captured by different Graphlet Laplacians. We apply Graphlet Laplacian-based spectral clustering for each graphlet on our set of human molecular networks and assess the functional enrichments in terms of the percentage of clusters enriched and the total number of annotations enriched (Fig. 4). Additionally, we create a baseline to validate the statistical significance of our enrichment results. We perform the same experiment 100 times with randomized GO-annotations. We do this by swapping the sets of gene annotations in the molecular networks such that no gene has its original set of annotations.

Fig. 4.

Cluster quality. (A) For our set of human molecular networks (colour-coded), the percentage of clusters enriched in BP annotations, with clusters obtained based on spectral clustering using different Laplacian matrices (x-axis). (B) For our set of human molecular networks, the total number of enriched GO-BP annotations in clusters obtained based on spectral clustering using different Laplacian matrices (x-axis)

Open in new tab Download slide

First, we observe that clusterings based on all Graphlet Laplacians but $L_{4}^{G}$ tend to be of similar quality as those based on the standard Laplacian or Vicus, both in terms of percentage of clusters enriched as well as total number of annotations enriched. $L_{2}^{P} and L_{3}^{P}$ capture the lowest amount of functions in PPI networks, both in terms of percentage of clusters enriched and GO-BP annotations enriched. Secondly, in our randomized experiment with randomized GO-annotations, we consistently find 0% of the clusters to be enriched, regardless of the type of Laplacian matrix used. This shows that all Laplacian-based enrichments are statistically significant. We find similar results in yeast, see Supplementary Section S5. In the Supplementary Material, we additionally observe that for each network and annotation type, there is always at least one Graphlet Laplacian that shows a larger number of the total number of enriched annotations than Vicus. We conclude that Graphlet Laplacians are at least as biologically relevant as the standard Laplacian, k-path Laplacian and Vicus.

Having established that Graphlet Laplacian-based clusters capture biological functions, we quantify the overlap in their enriched functions. In the Supplementary Section S6, we calculate the Jaccard Index between the sets of enriched functions corresponding to each Graphlet Laplacian. For GO-BP enrichments in clusterings on the human PPI and COEX networks, the average Jaccard Index is 0.22 and 0.30, respectively, meaning that different Graphlet Laplacians capture different functions. To further demonstrate this point, we present the number of GO-BP functions that are enriched only in the clustering obtained by a particular Graphlet Laplacian in Figure 5. We observe that each type of Laplacian matrix shows a tendency to capture some distinct biological functions, indicating the link between the biological function and the topology of these diverse molecular networks. The same is observed for GO-MF and GO-CC annotations, both in yeast and human networks (see the Supplementary Section S7). Combining this observation with our previous results, we can conclude that Graphlet Laplacian-based spectral clustering allows for distinguishing different sets of similarly wired network components that are not only biologically relevant, but may also capture complementary biological functions.

Fig. 5.

GO-BP uniquely enriched. The number of annotations that are uniquely enriched in clusterings based on the indicated Laplacian matrix for each biological network (colour coded)

Open in new tab Download slide

3.3 Different Graphlet Laplacians capture complementary sets of pan-cancer-related genes

Laplacian-based approaches towards predicting cancer-related genes are based on guilt by association: genes which tend to be connected to frequently mutated genes are used as cancer gene predictions. Here, we show that by considering the different shapes (i.e. graphlets) by which genes can be connected to frequently somatically mutated genes, complementary cancer mechanisms can be captured.

We do this by diffusing (see Section 2.5) the gene mutation frequency scores (see Section 2.8.4) on the human PPI network based on different Graphlet Laplacian matrices. Network diffusion is a method underlying many of the different approaches of cancer gene prioritization (Cowen et al., 2017). We prioritize genes as potential cancer-related genes according the highest diffused score first. We measure the quality of these scores using the area under the precision-recall (PR) curve and the area under the receiver operator characteristic (ROC) curve. We assume a gene is correctly classified as a cancer-related gene if it is known to be a cancer driver gene in at least one type of cancer (see Section 2.8.4). We observe that accuracy is independent of the Graphlet Laplacian used and on par with the standard Laplacian, with an average area under the PR and ROC curves of 0.21 and 0.78 respectively. In terms of accuracy Graphlet Laplacian-based scores consistently outperform those based on k-path or Vicus, which achieve an average area under the PR and ROC curve of 0.17 and 0.74 and 0.14 and 0.73 respectively (see Supplementary Figs S17 and S18).

In Figure 6, we evaluate the overlap between the top hundred highest ranking cancer-related genes per Laplacian, measured using the Jaccard Index. We observe five distinct clusters of different Laplacian matrices capturing different sets of cancer-related genes. Importantly, diffusion based on three sets of Graphlet Laplacians (⁠ $L_{{2, 5, 7}}^{G}$ ⁠, $L_{{1, 3, 4, 6}}^{G}$ and $L_{{8}}^{G}$ ⁠) provide scores dissimilar to those achieved using the standard Laplacian (the average Jaccard Index of each cluster with the standard Laplacian-based scores being 0.79, 0.87, 0.65 respectively). Conversely, the highest scoring genes based on $L_{{2, 3}}^{P}$ prove to overlap greatly with those based on the standard Laplacian (the average Jaccard Index being 0.91). Vicus-based diffusion provides cancer-related gene scores dissimilar from all other Laplacian matrices, be it at lower accuracy, as shown above. Similar results are obtained applying graphlet-generalized diffusion on the human COEX network, as shown in Supplementary Sections S8 and S9. We conclude that Graphlet Laplacian-based diffusion can be used to find complementary sets of cancer-related genes.

Fig. 6.

Overlap of highest scoring cancer-related genes. Evaluating the overlap in the sets of the top 100 highest scoring genes based on different Laplacians measured by the Jaccard Index, with scores computed by performing network diffusion of mutation frequency scores on the human PPI network

Open in new tab Download slide

4 Conclusion

In this article, we introduce Graphlet Laplacians for simple networks to simultaneously capture graphlet-based topological information and neighbourhood membership information. We demonstrate that they can straightforwardly be plugged into current Laplacian-based network analysis methods widely used in systems biology, using spectral clustering, spectral embedding and network diffusion as example applications.

Through our generalized spectral embedding and spectral clustering methods on real and model networks, we show that different Graphlet Laplacians capture sub-networks having distinct local topologies and that are enriched in different, but complementary sets of biological annotations. Finally, we show that our generalized network diffusion of pan-cancer gene mutation scores resulted in complementary sets of cancer-related genes for gene prioritization dependent on the underlying graphlet. In all the tested applications, our Graphlet Laplacians perform as good as and often better than k-path and Vicus Laplacians, while being directly interpretable.

As indicated, Graphlet Laplacians can directly replace the traditional Laplacian matrix in state-of-the-art network analysis methods, allowing them to consider alternative ways of how nodes are connected. For instance, our Graphlet Laplacians could be used to extend embedding methods such as hyper-coalescent embedding (Muscoloni et al., 2017), which may result in more relevant community detections in biological networks and in more accurate analyses of the dynamics of cells’ biological processes. Furthermore, Laplacian matrices are used in data-integration frameworks to incorporate prior knowledge (e.g. via so-called graph regularizations in matrix factorization-based data integration). Thus, using our Graphlet Laplacians in such data-integration frameworks could result in biologically more accurate patient stratifications, biomarker discoveries and drug-target interaction predictions (Gligorijević et al., 2016).

Finally, the applications of Graphlet Laplacians are not limited to biology, as the generalized network-analysis tools are applicable in any discipline that uses network representations, including physics, social sciences and economy.

Funding

This work was supported by the European Research Council (ERC) Starting Independent Researcher [Grant 278212]; the European Research Council (ERC) Consolidator [Grant 770827]; the Serbian Ministry of Education and Science [Project III44006]; the Slovenian Research Agency [project J1-8155]; the Prostate Project and UCL Computer Science departmental funds.

Conflict of Interest: none declared.

References

Ashburner

et al. (

2000

)

Gene Ontology: tool for the unification of biology

Nat. Genetics

–

Month:	Total Views:
June 2019	30
July 2019	17
August 2019	10
September 2019	21
October 2019	22
November 2019	34
December 2019	18
January 2020	73
February 2020	28
March 2020	24
April 2020	21
May 2020	23
June 2020	6
July 2020	10
August 2020	11
September 2020	10
October 2020	19
November 2020	23
December 2020	7
January 2021	2
February 2021	11
March 2021	4
April 2021	48
May 2021	26
June 2021	26
July 2021	33
August 2021	19
September 2021	32
October 2021	20
November 2021	19
December 2021	32
January 2022	34
February 2022	41
March 2022	35
April 2022	39
May 2022	37
June 2022	17
July 2022	36
August 2022	54
September 2022	39

Article Contents

Graphlet Laplacians for topology-function and topology-disease relationships

Abstract

1 Introduction

1.1 Laplacian matrix definition

1.2 Laplacian matrix eigendecomposition

1.3 Matrix alternatives to the Laplacian

1.4 Problem

1.5 Contribution

2 Materials and methods

2.1 Graphlet Laplacian definition

2.2 Graphlet Laplacian properties

2.3 Spectral embedding

2.4 Spectral clustering

2.5 Network diffusion

2.6 Topological dissimilarity of networks

2.6.1 Graphlet correlation distance

2.6.2 Non-graphlet based network descriptors

2.7 Cluster enrichment analysis

2.8 Data

2.8.1 Real biological network data collection

2.8.2 Random model network generation

2.8.3 Biological annotations

2.8.4 Cancer gene annotations

3 Results and discussion

3.1 Graphlet Laplacians capture different local topologies

3.2 Different Graphlet Laplacians capture different biological functions

3.3 Different Graphlet Laplacians capture complementary sets of pan-cancer-related genes

4 Conclusion

Funding

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

This Feature Is Available To Subscribers Only