Graph convolutional networks for epigenetic state prediction using both sequence and 3D genome data Free

Datasets summary

Cell line	Train windows	Valid windows	Test windows	TFs	HMs	DNase I
GM12878	368 082	89 911	79 731	90	11	2
K562	457 609	106 777	117 815	150	12	2

Cell line	Train windows	Valid windows	Test windows	TFs	HMs	DNase I
GM12878	368 082	89 911	79 731	90	11	2
K562	457 609	106 777	117 815	150	12	2

Note: GM12878 contains 103 total chromatin profile labels, and K562 contains 164 total labels. We use the same chromosomes for training, validation, and testing for both datasets.

Table 1.

Datasets summary

Cell line	Train windows	Valid windows	Test windows	TFs	HMs	DNase I
GM12878	368 082	89 911	79 731	90	11	2
K562	457 609	106 777	117 815	150	12	2

Cell line	Train windows	Valid windows	Test windows	TFs	HMs	DNase I
GM12878	368 082	89 911	79 731	90	11	2
K562	457 609	106 777	117 815	150	12	2

Note: GM12878 contains 103 total chromatin profile labels, and K562 contains 164 total labels. We use the same chromosomes for training, validation, and testing for both datasets.

3.2 3D Genome data

We then use 3D genome, also known as chromatin conformation, data from Hi-C contact maps to extract interaction evidence between the DNA windows. We use 1000 bp resolution intra-chromosome Hi-C data from (Rao et al., 2014; for K562, the lowest resolution is 5000 bp, so we upsample to get 1000 bp resolution). We convert the Hi-C contact map for each chromosome into a graph whose nodes are 1000 bp DNA windows and whose edges represent contact between two 1000 bp windows. Since the full Hi-C contact for each chromosome is too dense, we rank each contact edge, and use the top 500 000 Hi-C contacts as edges per chromosome (each chromosome maps to a Hi-C graph). Contacts are ranked using the SQRTVC normalization from (Rao et al., 2014), which normalizes for the distance between two positions so that long-range contacts are included in the top 500 k. While we chose these specific DNA window and Hi-C resolution sizes, we want to emphasize that our model is applicable to any window and Hi-C data sizes.

4 Materials and methods

Our goal is to learn a model f which takes in a DNA subsequence window $x_{i}$ and predicts the probability of a set of epigenetic labels ${\hat{y}}_{i} = f (x_{i})$ ⁠, where ${\hat{y}}_{i}$ is an L dimensional vector. Our method, ChromeGCN uses three submodules for f: f_CNN, f_GCN and f_Pred. The first module, f_CNN, models local sequence patterns from each window using a CNN. This module takes as input $x_{i}$ and outputs a vector representation $h_{i} = f_{CNN} (x_{i})$ ⁠. The second module, f_GCN, models long-range 3D genome dependencies between windows using a GCN. This module takes as input all window vectors $h_{i}$ concatenated as H, as well as their Hi-C relationships represented by adjacency matrix A, and outputs refined representations of all windows $Z = f_{GCN} (H, A)$ ⁠. The $z_{i}$ of each window now encodes both window sequence patterns and the relationships between windows. We can then predict the epigenetic labels using a classifier function on each $z_{i}$ using ${\hat{y}}_{i} = f_{Pred} (z_{i})$ ⁠. An overview of ChromeGCN is shown in Figure 1c. The following sections explain each submodule in detail.

4.1 Modeling local sequence representations using CNNs

Following the recent successes in many epigenetic label prediction tasks (Alipanahi et al., 2015; Lanchantin et al., 2017; Quang and Xie, 2016 ; Zhou and Troyanskaya, 2015; Zhou et al., 2018), we learn a representation of each genomic window sequence $x_{i}$ using a CNN. CNNs have become the de facto standard for encoding short DNA windows due to their properties, which effectively capture local sequence structure. Each learned kernel, or filter, in CNNs effectively learns a DNA ‘motif’, or short contiguous sequence representative of a particular output label (Alipanahi et al., 2015). Since many epigenetic processes are hypothesized to be dependent on motifs (Bailey et al., 2015), CNNs are a good choice for encoding DNA.

This module, f_CNN, takes an input genomic sequence window

x_{i}

⁠, and outputs an embedding representation vector

h_{i}

⁠. We represent window

x_{i}

of length τ as a one-hot encoded matrix

X_{i} \in ℝ^{τ \times n_{i n}}

⁠, where n_in is 4, representing the base-pair characters A, C, G, and T. Convolution with filters (i.e. learned motifs) of length

k < τ

takes an input data matrix

X_{i}

of size

τ \times n_{i n}

⁠, and outputs a matrix

X'_{i}

of size

τ \times n_{out}

⁠, where n_out is the chosen dimension of the learned hidden representations:

X'_{i_{t, u}} = σ (\sum_{j = 1}^{n_{i n}} \sum_{z = 1}^{k} W_{u, j, z} X_{i_{t + z - 1, j}}),

(1)

where

W \in ℝ^{n_{out} \times n_{i n} \times k}

are the trainable weights, and σ is a function enforcing element-wise nonlinearity.

Equation (1) can then be repeated for several layers where each successive layer uses a new W and n_in is replaced with n_out from the previous layer. In our implementation, we use six layers of convolution where each successive layer learns higher-order motifs of the window. After the convolutional layers, the output of the last layer is flattened into a vector and then linearly transformed into a lower-dimensional vector of size d, which we denote $h_{i}$ ⁠. Succinctly, the CNN module computes the following: $h_{i} = f_{CNN} (x_{i})$ for each window.

4.2 Modeling long-range 3D genome relationships using GCNs

Although CNN models work well on independent local window sequences, they disregard known long-range relationships between windows that are influential in the chromatin profile. One option would be to extend the window size. However, due to the 3D shape of DNA, long-range contact dependencies may be located millions of base-pairs apart, making current convolutional models infeasible. In this section, we introduce the f_GCN module, a method to explicitly and efficiently model such long-range interactions using GCNs.

Known long-range relationships in the 3D genome are available in the form of Hi-C contact maps. A Hi-C map can be represented as an adjacency matrix A, where the non-zero elements indicate contacts in the 3D space between two DNA windows [in our experiments, we use a different adjacency matrix A for each chromosome (intra-chromosome Hi-C maps). However, we generalize a A to represent all possible window interactions (i.e. including inter-chromosome maps)]. In our formulation, we represent sequence windows

x_{i}

as nodes on a graph, and A are the edges between the windows. We can then use a modified version of GCNs (Kipf and Welling, 2016) to update each

x_{i}

with its neighboring windows

x_{j}, j \in N (i)

⁠, where

N (i)

denotes the neighbors of node i obtained from A. The GCN works by revising a window’s representation

h_{i}^{t}

⁠, where

h_{i}^{0}

is from the output of the first module, f_CNN. We denote t to represent the GCN layer index. Specifically, each

h_{i}^{t}

is revised using a parameterized summation of neighbors,

h_{j}^{t}, j \in N (i)

⁠:

h_{i}^{t + 1} = σ (\frac{1}{| N (i) |} \sum_{j \in N (i)} h_{j}^{t} W^{t}),

(2)

where

σ (\cdot)

is a non-linear activation function such as tanh and

W^{t} \in ℝ^{d \times d}

is a linear feature transformation matrix for the GCN layer t. Importantly, using the summation in Equation (2), the representation of each DNA window

x_{i}

is updated based on the representation of its neighbors (windows that interact with

x_{i}

in the 3D genome). We can compute the simultaneous update of all windows together by concatenating all

h_{i}

denoted

H^{t} \in ℝ^{N \times d}

where N is the number of DNA windows and d is the dimension of each

h_{i}

⁠. The simultaneous update can then be written as:

H^{t + 1} = σ (A' H^{t} W^{t}) .

(3)

where

A' = {\hat{D}}^{- \frac{1}{2}} (A + I) {\hat{D}}^{- \frac{1}{2}}

⁠, is the normalized adjacency matrix and

\hat{D}

is the diagonal degree matrix of

(A + I)

⁠.

In our experiments, we use a variant of the GCN, which uses a gating function allowing the model to use or not use neighboring windows to update each

h_{i}^{t}

⁠:

{\tilde{H}}^{t} = \tanh (A' H^{t} W_{z}^{t})

(4)

g^{t} = sigmoid ({\tilde{H}}^{t} w_{g}^{t})

(5)

H^{t + 1} = diag (g^{t}) {\tilde{H}}^{t} + diag (1 - g^{t}) H^{t}

(6)

where

W_{z}^{t}

is a linear transformation matrix,

1

is a vector of all 1 s, and

w_{g}^{t} \in ℝ^{d}

is used to compute the gating vector. Gating vector

g^{t}

allows the model to selectively choose between using the neighborhood representation of nodes,

{\tilde{H}}^{t}

⁠, or the independent representation,

H^{t}

⁠. Equations (3–6) indicate one layer update of GCN window embeddings H. In our experiments, we use a two-layer gated GCN (i.e. t = 2), and we denote the final output of all

H^{t}

as Z, where vector

z_{i}

of Z represents the output of window i. In summary, the GCN module computes the following:

Z = f_{GCN} (H, A)

⁠.

4.3 Predicting label probabilities for each window

After Z is computed, we then use a linear classifier layer, f_Pred to classify each

z_{i}

into its output space (a set of epigenetic labels):

{\hat{y}}_{i} = σ (z_{i}^{⊤} W + b)

⁠. In summary, the prediction

{\hat{y}}_{i}

for a particular input sample

x_{i}

can be decomposed as three steps:

{\hat{y}}_{i} = f_{Pred} (z_{i})

Z = f_{GCN} (H, A)

h_{i} = f_{CNN} (x_{i})

4.4 Identifying and interpreting important Hi-C edges via Hi-C saliency maps

One benefit of the ChromeGCN formulation is that we use explicit, or known, long-range window dependencies for chromatin profile prediction. As a result, we propose a method to identify the important dependencies for ChromeGCN’s predictions. We call our proposed method Hi-C saliency maps. Saliency maps were introduced by Simonyan et al. (2013) to understand the importance of each pixel in an input

x_{i}

for the prediction of the image’s true class. We instead are trying to understand the importance of each edge in

A'

for the prediction of the chromatin profile over all windows. The

A'

saliency map is defined as the absolute value gradient of a true class prediction

{\hat{y}}_{i}^{ℓ}

with respect to

A'

⁠, where

ℓ

is a true class for sample i. The absolute value gradient is then element-wise multiplied by

A'

to zero out ‘non-edges’. Since there are N samples, and there can be multiple true labels

ℓ

for a particular sample

\hat{y_{i}}

⁠, we define the Hi-C saliency map,

S_{H i - C}

⁠, as the accumulated absolute gradient over all samples and true labels w.r.t

A'

⁠:

S_{H i - C} = \sum_{i = 1}^{N} \sum_{ℓ \in \hat{y_{i}}} A' ° | \frac{\partial {\hat{y}}_{i}^{ℓ}}{\partial A'} |,

(7)

where

°

is the Hadamard product. Since the saliency map of all windows N is accumulated, we normalize

S_{H i - C}

across each row to a 0–1 range so that we can interpret the edges at each window.

While Hi-C contact maps tell us where the contacts are, Hi-C saliency maps show us how important each contact is for the chromatin profiles. We define Equation (7) to be over all labels, but we can easily visualize the Hi-C saliency, or important edges for one particular label. We show both the full Hi-C saliency across all labels, as well as for one specific label (YY1) in the experiments.

4.5 Model variations

To test the effectiveness of the long-range dependencies, we use the following ChromeGCN variations. Each variation uses the same model, with different edge dependencies in the form of A.

ChromeGCN_const: Instead of using Hi-C edges we use a constant set of nearby neighbors according to the 1D sequential DNA representation. We define each window $x_{i}$ ’s neighbors to be the windows surrounding $x_{i}$ (7 on each side: $x_{i - 7}, \dots, x_{i + 7}$ ⁠) which we denote as $A_{const}$ ⁠. $Z = GCN (A_{const}, H)$ ⁠. This variant allows us to see whether the very long-range interactions from the normalized Hi-C maps are useful.

ChromeGCN $_{H i - C}$ : This variation uses the original Hi-C adjacency matrix, $A_{H i - C}$ ⁠. $Z = GCN (A_{H i - C}, H)$ ⁠.

Hi-C contacts are close neighboring contacts. However, by using top 500k contacts after the SQRT normalization for the Hi-C graph, we reduced some of this locality bias in the graph. This results in many of the edges being far away in the 1D space. This allows us to decouple the effects of local neighboring contacts (constant) and long-range (normalized Hi-C) contacts.

ChromeGCN $_{const + H i - C}$ : Last, we use a combination of the constant neighborhood around each window and the Hi-C adjacency matrix, which integrates close and far windows for each window. This variation uses the following function: $Z = GCN (A_{const + H i - C}, H)$ ⁠.

4.6 Model details and training

To circumvent GPU memory constraints of training end-to-end, we pre-train the f_CNN model by classifying each $h_{i}$ with the classification function ${\hat{y}}_{i} = f_{Pred} (h_{i})$ ⁠. Once the pre-training converges on the training set, we use the trained weights $h_{i}$ for each sample as fixed inputs to f_GCN. Although we pre-train f_CNN, ChromeGCN is still end-to-end differentiable, making it possible to use sequence visualization methods such as DeepLIFT (Shrikumar et al., 2017) for a particular window.

For all model predictions, we run the forward and the reverse complement through simultaneously and average the output of the two. All DNA window inputs are encoded using a lookup table that maps each character A, C, G, T and N (unknown) to a d-dimensional vector. The output of the encoding is a $d \times τ$ matrix, where τ denotes sequence length (τ = 2000 in our experiments).

All of our models are trained using stochastic gradient descent with momentum of 0.9 and a learning rate of 0.25. The CNN model is trained using a batch size of 64, and the GCN and RNN models are trained using an entire chromosome as a batch (since each is modeling the between window dependencies of an entire chromosome at once). The CNN model projects each window to a vector of dimension 128. The GCN uses two layers of feature dimension 128 at each layer.

ChromeGCN predicts the probabilities of all labels for each window:

\hat{y_{i}} \in ℝ^{L}

⁠, where L is the total number of labels. For our loss function, we use the mean binary cross-entropy across all samples N and labels L:

L (y, \hat{y}) = \frac{1}{M} \sum_{i = 1}^{N} \frac{1}{L} \sum_{l = 1}^{L} - (y_{i}^{l} log ({\hat{y}}_{i}^{l}) + (1 - y_{i}^{l}) log (1 - {\hat{y}}_{i}^{l}))

(8)

5 Experiments and results

5.1 Baselines

We compare against the state-of-the-art chromatin profile prediction model from (Zhou et al., 2018), referred to as the CNN baseline, as well as the recurrent model from (Quang and Xie, 2016). Since our model outputs labels for TFBS, HMs and accessibility, motif-based methods (Bailey et al., 2009) aren’t applicable. Since we have 368 082 training samples, kernel-based methods such as (Ghandi et al., 2014) aren’t applicable (Zhou and Troyanskaya, 2015) compared their CNN to a modified version of (Ghandi et al., 2014), which only used a small number of training samples, and the CNN model was significantly better. Our CNN baseline, (Zhou et al., 2018) is an improved version from Zhou and Troyanskaya (2015). Furthermore, the focus of our study is to show that state-of-the-art deep learning models are missing important long-range dependencies in the genome.

CNN: To illustrate the importance of the GCN, we compare against the outputs from the f_CNN module: ${\hat{y}}_{i} = f_{Pred} (h_{i})$ ⁠. This is the six-layer CNN model from Zhou et al. (2018; we modify the last layer in order to extract a d-dimensional feature vector output). This is the same CNN that is pre-trained for ChromeGCN to produce each $h_{i}$ ⁠.

DanQ: This model uses a RNN on top of CNN outputs within a window. It still uses local sequence window inputs, but models relationships between sequence patterns via an LSTM (Quang and Xie, 2016).

ChromeRNN: As a baseline to compare against using GCNs for long-range dependency modeling, we construct an RNN model on the window embeddings $h_{1}, h_{2}, \dots, h_{N}$ ⁠. After pre-training the CNN module f_CNN, The RNN model takes in all window embeddings at once and models the sequential dependencies among windows: $(z_{1}, z_{2}, z_{3}, \dots, z_{N}) = f_{RNN} (h_{1}, h_{2}, h_{3}, \dots, h_{N})$ ⁠. As with ChromeGCN, the RNN is shared across chromosomes, but does not cross chromosomes. In other words, the embeddings are updated one chromosome at a time $f_{RNN} (h_{i}, h_{i + 1}, \dots, h_{C})$ ⁠, where C is the total number of windows for a chromosome. We note this is different from the DanQ baseline (Quang and Xie, 2016) which uses an RNN within windows (Quang and Xie, 2016). ChromeRNN instead is for modeling dependencies between windows. In our experiments, we use an LSTM (Quang and Xie, 2016) with the same number of layers and hidden units as the GCN.

5.2 Prediction performance

To evaluate the methods, we use area under the ROC curve (AUROC), area under the precision-recall curve (AUPR), and the mean recall at 50% false discovery rate (FDR) cutoff. Table 2 shows the mean metric results across all epigenetic labels for each cell line. Modeling long-range dependencies results in significant improvements over the baseline CNN model, which does not account for such long-range interactions. For instance, with respect to AUPR for GM12878, ChromeRNN improves upon the CNN from 0.350 to 0.384 and ChromeGCN outperforms ChromeRNN to achieve a mean AUPR of 0.395. Also, we see that the ChromeRNN outperforms DanQ, indicating that using an RNN to model between window dependencies is more important than within window features. Moreover, we can see that the ChromeGCN $_{H i - C}$ models outperform ChromeRNN, indicating that not only the closely neighboring windows in the 1D space contribute to the improvements, but also the close neighbors in the 3D space, as indicated by the used Hi-C maps. ChromeGCN outperforms the baselines on the TF and DNase I labels, and ChromeRNN outperforms all other methods on the HM labels. This indicates that non-local modeling is particularly important for TF binding and accessibility. We provide the performance results of each label type (TF, HM and DNase I) are shown in Tables 3 and 4. We also provide detailed plots of ROC and precision-recall curves in Supplementary Appendix Figures SA5–8.

Table 2.

Performance results

	GM12878			K562
	Mean AUROC	Mean AUPR	Mean Recall at 50% FDR	Mean AUROC	Mean AUPR	Mean Recall at 50% FDR
CNN (Zhou et al., 2018)	0.895	0.350	0.293	0.894	0.325	0.265
DanQ (Quang and Xie, 2016)	0.886	0.348	0.290	0.900	0.343	0.290
ChromeRNN	0.906	0.384	0.342	0.910	0.365	0.327
ChromeGCN_const	0.904	0.377	0.331	0.904	0.358	0.321
ChromeGCN $_{H i - C}$	0.904	0.385	0.341	0.903	0.358	0.319
ChromeGCN $_{const + H i - C}$	0.909	0.395	0.356	0.912	0.372	0.338

	GM12878			K562
	Mean AUROC	Mean AUPR	Mean Recall at 50% FDR	Mean AUROC	Mean AUPR	Mean Recall at 50% FDR
CNN (Zhou et al., 2018)	0.895	0.350	0.293	0.894	0.325	0.265
DanQ (Quang and Xie, 2016)	0.886	0.348	0.290	0.900	0.343	0.290
ChromeRNN	0.906	0.384	0.342	0.910	0.365	0.327
ChromeGCN_const	0.904	0.377	0.331	0.904	0.358	0.321
ChromeGCN $_{H i - C}$	0.904	0.385	0.341	0.903	0.358	0.319
ChromeGCN $_{const + H i - C}$	0.909	0.395	0.356	0.912	0.372	0.338

Note: For both cell lines, GM12878 and K562, we show the average across all labels for three different metrics. Our method, using a GCN to model long-range dependencies helps improve performance over the baseline CNN model which assumes all DNA segments are independent. Detailed results of all cell lines and label types are shown in the Supplementary Appendix. Best performing methods are shown in bold.

Table 2.

Performance results

	GM12878			K562
	Mean AUROC	Mean AUPR	Mean Recall at 50% FDR	Mean AUROC	Mean AUPR	Mean Recall at 50% FDR
CNN (Zhou et al., 2018)	0.895	0.350	0.293	0.894	0.325	0.265
DanQ (Quang and Xie, 2016)	0.886	0.348	0.290	0.900	0.343	0.290
ChromeRNN	0.906	0.384	0.342	0.910	0.365	0.327
ChromeGCN_const	0.904	0.377	0.331	0.904	0.358	0.321
ChromeGCN $_{H i - C}$	0.904	0.385	0.341	0.903	0.358	0.319
ChromeGCN $_{const + H i - C}$	0.909	0.395	0.356	0.912	0.372	0.338

	GM12878			K562
	Mean AUROC	Mean AUPR	Mean Recall at 50% FDR	Mean AUROC	Mean AUPR	Mean Recall at 50% FDR
CNN (Zhou et al., 2018)	0.895	0.350	0.293	0.894	0.325	0.265
DanQ (Quang and Xie, 2016)	0.886	0.348	0.290	0.900	0.343	0.290
ChromeRNN	0.906	0.384	0.342	0.910	0.365	0.327
ChromeGCN_const	0.904	0.377	0.331	0.904	0.358	0.321
ChromeGCN $_{H i - C}$	0.904	0.385	0.341	0.903	0.358	0.319
ChromeGCN $_{const + H i - C}$	0.909	0.395	0.356	0.912	0.372	0.338

Table 3.

Performance on GM12878 for each label category

	TFs			HMs			DNase I
	Mean AUROC	Mean AUPR	Mean recall at 50% FDR	Mean AUROC	Mean AUPR	Mean recall at 50% FDR	Mean AUROC	Mean AUPR	Mean recall at 50% FDR
CNN (Zhou et al., 2018)	0.909	0.328	0.262	0.796	0.478	0.469	0.801	0.657	0.750
DanQ (Quang and Xie, 2016)	0.899	0.325	0.260	0.794	0.479	0.463	0.792	0.650	0.723
ChromeRNN	0.914	0.354	0.299	0.859	0.574	0.610	0.821	0.689	0.787
ChromeGCN_const	0.913	0.348	0.291	0.849	0.554	0.58	0.815	0.68	0.777
ChromeGCN $_{H i - C}$	0.916	0.362	0.309	0.819	0.516	0.528	0.814	0.683	0.774
ChromeGCN $_{const + H i - C}$	0.918	0.369	0.319	0.845	0.552	0.584	0.820	0.692	0.783

	TFs			HMs			DNase I
	Mean AUROC	Mean AUPR	Mean recall at 50% FDR	Mean AUROC	Mean AUPR	Mean recall at 50% FDR	Mean AUROC	Mean AUPR	Mean recall at 50% FDR
CNN (Zhou et al., 2018)	0.909	0.328	0.262	0.796	0.478	0.469	0.801	0.657	0.750
DanQ (Quang and Xie, 2016)	0.899	0.325	0.260	0.794	0.479	0.463	0.792	0.650	0.723
ChromeRNN	0.914	0.354	0.299	0.859	0.574	0.610	0.821	0.689	0.787
ChromeGCN_const	0.913	0.348	0.291	0.849	0.554	0.58	0.815	0.68	0.777
ChromeGCN $_{H i - C}$	0.916	0.362	0.309	0.819	0.516	0.528	0.814	0.683	0.774
ChromeGCN $_{const + H i - C}$	0.918	0.369	0.319	0.845	0.552	0.584	0.820	0.692	0.783

Note: Best performing methods are shown in bold.

Table 3.

Performance on GM12878 for each label category

	TFs			HMs			DNase I
	Mean AUROC	Mean AUPR	Mean recall at 50% FDR	Mean AUROC	Mean AUPR	Mean recall at 50% FDR	Mean AUROC	Mean AUPR	Mean recall at 50% FDR
CNN (Zhou et al., 2018)	0.909	0.328	0.262	0.796	0.478	0.469	0.801	0.657	0.750
DanQ (Quang and Xie, 2016)	0.899	0.325	0.260	0.794	0.479	0.463	0.792	0.650	0.723
ChromeRNN	0.914	0.354	0.299	0.859	0.574	0.610	0.821	0.689	0.787
ChromeGCN_const	0.913	0.348	0.291	0.849	0.554	0.58	0.815	0.68	0.777
ChromeGCN $_{H i - C}$	0.916	0.362	0.309	0.819	0.516	0.528	0.814	0.683	0.774
ChromeGCN $_{const + H i - C}$	0.918	0.369	0.319	0.845	0.552	0.584	0.820	0.692	0.783

	TFs			HMs			DNase I
	Mean AUROC	Mean AUPR	Mean recall at 50% FDR	Mean AUROC	Mean AUPR	Mean recall at 50% FDR	Mean AUROC	Mean AUPR	Mean recall at 50% FDR
CNN (Zhou et al., 2018)	0.909	0.328	0.262	0.796	0.478	0.469	0.801	0.657	0.750
DanQ (Quang and Xie, 2016)	0.899	0.325	0.260	0.794	0.479	0.463	0.792	0.650	0.723
ChromeRNN	0.914	0.354	0.299	0.859	0.574	0.610	0.821	0.689	0.787
ChromeGCN_const	0.913	0.348	0.291	0.849	0.554	0.58	0.815	0.68	0.777
ChromeGCN $_{H i - C}$	0.916	0.362	0.309	0.819	0.516	0.528	0.814	0.683	0.774
ChromeGCN $_{const + H i - C}$	0.918	0.369	0.319	0.845	0.552	0.584	0.820	0.692	0.783

Note: Best performing methods are shown in bold.

Table 4

Performance on K562 for each label category.

	TFs			HMs			DNase I
	Mean AUROC	Mean AUPR	Mean Recall at 50% FDR	Mean AUROC	Mean AUPR	Mean Recall at 50% FDR	Mean AUROC	Mean AUPR	Mean Recall at 50% FDR
CNN (Zhou et al., 2018)	0.905	0.314	0.256	0.78	0.410	0.317	0.792	0.625	0.683
DanQ (Quang and Xie, 2016)	0.909	0.331	0.279	0.796	0.438	0.365	0.797	0.638	0.693
ChromeRNN	0.917	0.349	0.305	0.848	0.520	0.528	0.817	0.664	0.742
ChromeGCN_const	0.911	0.342	0.301	0.84	0.507	0.507	0.810	0.657	0.729
ChromeGCN $_{H i - C}$	0.912	0.346	0.306	0.807	0.454	0.412	0.812	0.657	0.730
ChromeGCN $_{const + H i - C}$	0.919	0.358	0.320	0.837	0.495	0.484	0.821	0.67	0.752

	TFs			HMs			DNase I
	Mean AUROC	Mean AUPR	Mean Recall at 50% FDR	Mean AUROC	Mean AUPR	Mean Recall at 50% FDR	Mean AUROC	Mean AUPR	Mean Recall at 50% FDR
CNN (Zhou et al., 2018)	0.905	0.314	0.256	0.78	0.410	0.317	0.792	0.625	0.683
DanQ (Quang and Xie, 2016)	0.909	0.331	0.279	0.796	0.438	0.365	0.797	0.638	0.693
ChromeRNN	0.917	0.349	0.305	0.848	0.520	0.528	0.817	0.664	0.742
ChromeGCN_const	0.911	0.342	0.301	0.84	0.507	0.507	0.810	0.657	0.729
ChromeGCN $_{H i - C}$	0.912	0.346	0.306	0.807	0.454	0.412	0.812	0.657	0.730
ChromeGCN $_{const + H i - C}$	0.919	0.358	0.320	0.837	0.495	0.484	0.821	0.67	0.752

Note: Best performing methods are shown in bold.

Table 4

Open in new tab Download slide

Performance on K562 for each label category.

	TFs			HMs			DNase I
	Mean AUROC	Mean AUPR	Mean Recall at 50% FDR	Mean AUROC	Mean AUPR	Mean Recall at 50% FDR	Mean AUROC	Mean AUPR	Mean Recall at 50% FDR
CNN (Zhou et al., 2018)	0.905	0.314	0.256	0.78	0.410	0.317	0.792	0.625	0.683
DanQ (Quang and Xie, 2016)	0.909	0.331	0.279	0.796	0.438	0.365	0.797	0.638	0.693
ChromeRNN	0.917	0.349	0.305	0.848	0.520	0.528	0.817	0.664	0.742
ChromeGCN_const	0.911	0.342	0.301	0.84	0.507	0.507	0.810	0.657	0.729
ChromeGCN $_{H i - C}$	0.912	0.346	0.306	0.807	0.454	0.412	0.812	0.657	0.730
ChromeGCN $_{const + H i - C}$	0.919	0.358	0.320	0.837	0.495	0.484	0.821	0.67	0.752

	TFs			HMs			DNase I
	Mean AUROC	Mean AUPR	Mean Recall at 50% FDR	Mean AUROC	Mean AUPR	Mean Recall at 50% FDR	Mean AUROC	Mean AUPR	Mean Recall at 50% FDR
CNN (Zhou et al., 2018)	0.905	0.314	0.256	0.78	0.410	0.317	0.792	0.625	0.683
DanQ (Quang and Xie, 2016)	0.909	0.331	0.279	0.796	0.438	0.365	0.797	0.638	0.693
ChromeRNN	0.917	0.349	0.305	0.848	0.520	0.528	0.817	0.664	0.742
ChromeGCN_const	0.911	0.342	0.301	0.84	0.507	0.507	0.810	0.657	0.729
ChromeGCN $_{H i - C}$	0.912	0.346	0.306	0.807	0.454	0.412	0.812	0.657	0.730
ChromeGCN $_{const + H i - C}$	0.919	0.358	0.320	0.837	0.495	0.484	0.821	0.67	0.752

Note: Best performing methods are shown in bold.

Furthermore, since ChromeGCN models the window relationships explicitly and not recurrently, we obtain a significant speedup at test time over the ChromeRNN baseline. ChromeGCN achieves a 6.3× speedup on the three GM12878 test chromosomes and 6.8× speedup at test time on the three K562 test chromosomes.

5.3 Analysis of using Hi-C data

Figure 2 shows a detailed comparison of ChromeGCN $_{H i - C}$ versus the baseline CNN model across three different metrics for both GM12878 and K562. Each point represents a label, and the y-axis shows the absolute improvement of the ChromeGCN $_{H i - C}$ model over the CNN. The labels are sorted on the x-axis by the average degree of the label’s positive samples (i.e. windows where the label is positive) on the Hi-C map. We can see that for all three metrics, the improvements of the ChromeGCN over the CNN increase as the average degree of the labels increase. This indicates that the ChromeGCN is important for labels that have many neighbors in the Hi-C graph (i.e. those that are frequently in contact with other segments in the 3D space). Two of the TFs which obtain the highest performance increase (in the top 5) from using ChromeGCN over CNN, CEBPB STAT3, are validated by Ma et al. (2018), which show that these two TFs commonly co-occur with other TFs in the 3D space when binding.

Fig. 2.

Comparison of our ChromeGCN $_{H i - C}$ method versus the baseline CNN (Zhou et al., 2018) for three metrics. Each point represents one chromatin profile label. The labels are sorted in the x-axis by the average degree of their positive windows. The y-axis indicates absolute increase of the ChromeGCN $_{H i - C}$ over the CNN model. As the average degree increases, the improvement of the ChromeGCN $_{H i - C}$ model increases over the CNN. Green points indicate ChromeGCN $_{H i - C}$ performed better and red indicates the CNN performed better. The blue line shows the linear trend line. The ChromeGCN $_{H i - C}$ is significantly better, as demonstrated by the P values from a pairwise t-test

The P-values shown are computed by a pairwise t-test across all labels. The ChromeGCN $_{H i - C}$ model significantly outperforms the CNN model in all three metrics. Importantly, our results indicate that by using the long-range interactions given by Hi-C data, we can obtain improvements in modeling the chromatin state labeling, resulting in better classification accuracy.

5.4 Long-range interaction visualization

A significant merit of ChromeGCN is that by using known 3D genome relationships, we can find and visualize the critical relationships for chromatin profile prediction. To understand how important the Hi-C edges are for the predictions of ChromeGCN, we visualize the saliency map of $A'$ ⁠, as explained in Section 4.4.

Figure 3 shows the Hi-C saliency map for Chromosome 8 in GM12878. Figure 3 (left) shows all 500k Hi-C contacts used Chromosome 8. Windows are represented as points along the circle, with a total of 23 600 windows. Lines between the windows represent Hi-C edges, and the darkness of the line represents the saliency, or importance of that edge for chromatin state prediction across all windows in Chromosome 8. Figure 3 (right) shows the saliency map for 250 windows (total of 250 bp input) in Chromosome 8. Cell (i, j) tells us the importance of window column j for the prediction of window row i labels (Fig. 4).

Fig. 3.

Hi-C Saliency Map Visualization. Left: Saliency Map for the all 500k edges in $A_{H i - C}$ for GM12878 Chromosome 8 (total of 23 600 windows). The darker the line, the more important that edge was for predicting the correct Chromatin profile, indicating that the Hi-C data were used by the GNN for that particular interaction. Right: Fine-grained analysis of the Chromosome 8 Saliency Map. This figure shows the normalized Saliency Map values for 250 windows (total of 250 bp input) in Chromosome 8

Open in new tab Download slide

Although Figure 3 shows the Hi-C saliency map for all chromatin profile labels, we can also visualize the Hi-C saliency map for individual labels. The inner loop of Equation (7) changes to only use the label of interest. In Supplementary Appendix Figure SA9, we show the Hi-C saliency map for the YY1 TF on GM12878 Chromosome 8. This gives us insight into the important 3D contacts for YY1 binding.

6 Conclusion

In this work, we present ChromeGCN, a novel framework that combines both local sequence and long-range 3D genome data (via Hi-C data) for chromatin profile prediction. We show experimentally that ChromeGCN outperforms previous state-of-the-art methods that only use local sequence data. Additionally, we show that we can identify and visualize the important 3D genome dependencies using the proposed Hi-C saliency maps. We plan to further investigate the value of Hi-C saliency maps in future work.

In this work, we demonstrate our ChromeGCN on two well-known cell types. However, it is important to be able to port out method to other cell types. Fortunately, our method is broadly generalizable for any cell type where there is known epigenetics and HiC data. Additionally users can use any cell type which has HiC alike data (e.g. HiChIP or ChIA-PET). This is an important distinction of our work—all the user needs is data which denotes where the long-range relationships are in the genome. Furthermore, if a cell type does not have any training labels, we can use the genomic features and graph of a related cell type and perform transfer learning to predict on the new cell type. We plan to extend our work into this domain.

Although we demonstrate the importance of ChromeGCN on the task of chromatin profile prediction, ChromeGCN is a generic model for incorporating 3D genome structure into any genome sequence prediction task. We hope that this work encourages researchers to use known long-range relationships from 3D genome data when constructing machine learning models. We plan to release our data and code for greater visibility. ChromeGCN introduces an effective and efficient framework to model such relationships for better chromatin modeling, as well as an easy way to interpret important relationships.

Financial Support: This work was partly supported by the National Science Foundation under NSF CAREER award No. 1453580. Any Opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect those of the National Science Foundation.

Conflict of Interest: none declared.

References

Alipanahi

et al. (

2015

)

Predicting the sequence specificities of DNA-and RNA-binding proteins by deep learning

Nat. Biotechnol

831

–

838

et al. (

2014

)

Statistical confidence estimation for Hi-C data reveals regulatory chromatin contacts

Genome Res

999

–

1011

Bailey

S.D.

et al. (

2015

)

Znf143 provides sequence specificity to secure chromatin interactions at gene promoters

Nat. Commun

6186

Crossref

Bailey

T.L.

et al. (

2009

)

Meme suite: tools for motif discovery and searching

Nucleic Acids Res

W202

–

W208

Brackley

C.A.

et al. (

2012

)

Facilitated diffusion on mobile DNA: configurational traps and sequence heterogeneity

Phys. Rev. Lett

109

168103

ENCODE Project Consortium. (

2004

)

The ENCODE (encyclopedia of DNA elements) project

Science

306

636

–

640

Crossref

PubMed

Dai

et al. (

2016

)

Discriminative embeddings of latent variable models for structured data

. In

International Conference on Machine Learning

, pp.

2702

–

2711

OpenURL Placeholder Text

Dai

et al. (

2019

) Transformer-xl: attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860.

Devlin

et al. (

2018

) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

Ghandi

et al. (

2014

)

Enhanced regulatory sequence prediction using gapped k-mer features

PLoS Comput. Biol

e1003711

Gilmer

et al. (

2017

) Neural message passing for quantum chemistry. arXiv preprint arXiv:1704.01212.

Hakim

Misteli

(

2012

)

Snapshot: chromosome conformation capture

Cell

148

1068

e1–2

Crossref

Hamilton

W.L.

et al. (

2017

) Representation learning on graphs: methods and applications. arXiv preprint arXiv:1709.05584.

Hassanzadeh

H.R.

Wang

M.D.

(

2016

) Deeperbind: enhancing prediction of sequence specificities of DNA binding proteins. In 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp.

178

–

183

. IEEE.

Hochreiter

Schmidhuber

(

1997

)

Long short-term memory

Neural Comput

1735

–

1780

Kelley

D.R.

et al. (

2018

)

Sequential regulatory activity prediction across chromosomes with convolutional neural networks

Genome Res

739

–

750

Kipf

T.N.

Welling

(

2016

) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907.

Kundaje

et al. ; Roadmap Epigenomics Consortium. (

2015

)

Integrative analysis of 111 reference human epigenomes

Nature

518

317

–

330

Lanchantin

et al. (

2016

) Deep motif: visualizing genomic sequence classifications. arXiv preprint arXiv:1605.01133.

Lanchantin

et al. (

2017

) Deep motif dashboard: visualizing and understanding genomic sequences using deep neural networks. In Pacific Symposium on Biocomputing 2017, pp.

254

–

265

. World Scientific.

et al. (

2018

)

Canonical and single-cell hi-c reveal distinct chromatin interaction sub-networks of mammalian transcription factors

Genome Biol

174

Mifsud

et al. (

2015

)

Mapping long-range promoter contacts in human cells with high-resolution capture Hi C

Nat. Genet

598

–

606

Quang

Xie

(

2016

)

DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences

Nucleic Acids Res

e107

Rao

S.S.

et al. (

2014

)

A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping

Cell

159

1665

–

1680

Scarselli

et al. (

2008

)

The graph neural network model

IEEE Trans. Neural Netw

–

Schreiber

et al. (

2018

)

Nucleotide sequence and DNase I sensitivity are predictive of 3D chromatin architecture

bioRxiv

103614

OpenURL Placeholder Text