Abstract

Motivation

Predictive models of DNA chromatin profile (i.e. epigenetic state), such as transcription factor binding, are essential for understanding regulatory processes and developing gene therapies. It is known that the 3D genome, or spatial structure of DNA, is highly influential in the chromatin profile. Deep neural networks have achieved state of the art performance on chromatin profile prediction by using short windows of DNA sequences independently. These methods, however, ignore the long-range dependencies when predicting the chromatin profiles because modeling the 3D genome is challenging.

Results

In this work, we introduce ChromeGCN, a graph convolutional network for chromatin profile prediction by fusing both local sequence and long-range 3D genome information. By incorporating the 3D genome, we relax the independent and identically distributed assumption of local windows for a better representation of DNA. ChromeGCN explicitly incorporates known long-range interactions into the modeling, allowing us to identify and interpret those important long-range dependencies in influencing chromatin profiles. We show experimentally that by fusing sequential and 3D genome data using ChromeGCN, we get a significant improvement over the state-of-the-art deep learning methods as indicated by three metrics. Importantly, we show that ChromeGCN is particularly useful for identifying epigenetic effects in those DNA windows that have a high degree of interactions with other DNA windows.

Availability and implementation

https://github.com/QData/ChromeGCN.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

The human genome includes over 3 billion base pairs (bp), each being described as A, C, G or T. Chromatin (DNA and its organizing proteins) is responsible for many regulatory processes such as controlling the expression of a certain gene. Active chromatin elements such as transcription factors (TFs) proteins binding at particular location in DNA or histone modifications (HMs) are what constitute that location’s ‘chromatin profile’ (we use the terms ‘epigenetic state' and chromatin profile interchangeably). Understanding the chromatin profile of a local region of DNA is a step toward understanding how that region influences relevant gene regulation since chromatin profile is a direct factor in regulating expression. Since biological experiments are time-consuming and expensive, computational methods that can accurately simulate and predict the chromatin profile are crucial. Modeling the chromatin profile of each bp has been a long standing challenge due to the sheer length and complexity of genome DNA. Deep neural networks have shown state-of-the-art performance in extracting useful features from segments of DNA to predict the chromatin profile (e.g. if a TF protein binds to that location or not; Alipanahi et al., 2015; Zhou and Troyanskaya, 2015). However, these methods heuristically divide DNA into local “windows” (e.g. about 1000 bp long) and predict the states of each window independently, disregarding the effects of distant windows. Due to the spatial 3D organization of chromatin in a cell, distal DNA elements (potentially over 1 million bp away) have shown to have effects on chromatin profiles (Ma et al., 2018; Mifsud et al., 2015; Rao, et al., 2014).

Figure 1a illustrates the importance of using both sequence and 3D genome (i.e. chromosome conformation) data. This figure shows long-range dependencies between chromatin windows, where the colored shapes represent multiple TF proteins. TFs typically bind to specific sequence patterns in DNA known as motifs (Stormo, 2000). However, a TF may also bind to a DNA window due to the presence of other TFs nearby in the 3D space because they form a protein complex (Brackley et al., 2012; Ma et al., 2018). Such a case will result in motifs far away in the 1D genome coordinate space, but nearby in the 3D space. The corresponding dependencies between the chromatin windows are illustrated by the triangle, square and circle TFs. The two segments interacting in the middle of the diagram are very far in the 1D sequence representation (represented by the gray line), but very close in the 3D representation. Similarly, a TF may not bind to a segment with its motif present due to another interfering TF nearby in the 3D space. These types of interactions are lost in data-driven prediction models that only consider local DNA segments independently.

(a) 3D Genome. The 3D shape of chromatin can lead DNA ‘windows’ (shown in gray boxes) far apart in the 1D genome space to be spatially close. These spatial interactions can influence chromatin profiles, such as TFs binding (as shown by the colored shapes). In most cases, the DNA sequence determines the chromatin profile. However, it can also be influenced by interactions, such as the formation of TF complexes shown in the middle. (b) Graph Representation of DNA. Using Hi-C data, we can represent subfigure (a) using a graph, where the lines between windows are the edges indicated by Hi-C data. (c) ChromeGCN. By using a GCN on top of convolutional outputs the model considers the known dependencies between long-range DNA windows. The lines between windows correspond to edges in Hi-C data
Fig. 1.

(a) 3D Genome. The 3D shape of chromatin can lead DNA ‘windows’ (shown in gray boxes) far apart in the 1D genome space to be spatially close. These spatial interactions can influence chromatin profiles, such as TFs binding (as shown by the colored shapes). In most cases, the DNA sequence determines the chromatin profile. However, it can also be influenced by interactions, such as the formation of TF complexes shown in the middle. (b) Graph Representation of DNA. Using Hi-C data, we can represent subfigure (a) using a graph, where the lines between windows are the edges indicated by Hi-C data. (c) ChromeGCN. By using a GCN on top of convolutional outputs the model considers the known dependencies between long-range DNA windows. The lines between windows correspond to edges in Hi-C data

However, modeling these known long-range interactions between windows is difficult. Local sequence window-based prediction methods assume data samples are independent according to the commonly used independent and identically distributed (IID) assumption. Yet, the long-range dependencies existing in DNA make windows not IID.

Modeling long-range or non-local interactions, has had a long history in many areas such as natural language processing, where the label of one particular segment depends on the label of a segment far away. Recurrent neural networks (RNNs) such as Long short-term memory netoworks (LSTMs) (Hochreiter and Schmidhuber, 1997) have been used to model non-local dependencies where the model relies on the hidden state to remember the state of a token (e.g. a word) very far away. However, LSTMs are known to only remember a small number of tokens back, leading to rather ‘local’ relationship learning (Hochreiter and Schmidhuber, 1997; Vaswani et al., 2017). This drawback has lead to an increasing interest in the explicit modeling of non-local dependencies via pairwise interaction models such as transformers (Dai et al., 2019; Devlin et al., 2018; Vaswani et al., 2017).

In a related line of work, graph convolutional networks (GCNs) have been proposed to model the pairwise dependencies of nodes in graph or 3D structured data such as citation networks and point clouds (Kipf and Welling, 2016; Scarselli et al., 2008; Wang et al., 2018, Zitnik et al., 2018). This direct modeling of edges allows the network to learn non-local relationships. Although typically viewed in its 1D sequential form, DNA can be represented as 3D genome structured data via Hi-C maps, as shown in Figure 1b. Hi-C maps are matrices that give the number of contacts between two segments of DNA, and normalized Hi-C maps tell us the likelihood of two locations interacting (Ay et al., 2014). Using Hi-C data, segments of DNA can be represented as nodes on a graph, and edges are interactions between segments. Such interactions can be crucial in regulatory processes such as gene transcription (Rao et al., 2014). That is, Hi-C contacts are a direct reflection of how distant epigenetic elements interact. We hypothesize that accounting for such interactions will lead to improved chromatin profile prediction accuracy.

In this work, we propose ChromeGCN, a novel method that uses a fusion of both sequence and 3D genome data (in the form of Hi-C maps) to predict the chromatin profile of DNA segments. To the best of our knowledge, ChromeGCN is the first deep learning framework that successfully combines sequence and 3D genome data to model both local sequence features and long-range dependencies for chromatin profile prediction. ChromeGCN works by first representing DNA windows as a d-dimensional vector with a convolutional neural network (CNN) on the local window sequence. We then revise the window vector using a GCN on all window relationships from Hi-C 3D genome data. We test ChromeGCN on datasets from two cell lines where we compare against the previous state of the art chromatin profile prediction methods. We demonstrate that ChromeGCN outperforms previous methods, especially for profile labels that are highly correlated with long-range chromatin interactions.

An important aspect of ChromeGCN is that it allows us to better understand Hi-C data in the context of the chromatin profile. Hi-C maps tell us where the contacts are in the genome, but they don’t tell us important contacts for epigenetic labels. Using ChromeGCN, we propose Hi-C saliency maps to understand which Hi-C contacts are most important for chromatin profile labeling. Since ChromeGCN uses explicit long-range relationships from Hi-C data (as opposed to implicit long-range relationships using a RNN), we can easily understand the important relationships for greater interpretability.

The main contributions of this article are:

  1. We propose ChromeGCN, a novel framework that incorporates both local sequence and long-range 3D genome data for chromatin profile prediction.

  2. We experimentally validate the importance of ChromeGCN on two cell lines from ENCODE, showing that modeling long-range genome dependencies is critically important.

  3. We introduce Hi-C saliency maps, a method to identify the important long-range interactions for chromatin profile prediction from Hi-C data.

2 Background and related work

2.1 Predicting chromatin profile using machine learning

Computational models for accurately predicting chromatin profile labels from DNA sequence have gained popularity in recent years due to the urgency of the task for many applications. For instance, predicting how epigenetic effects vary when variants in DNA occur. The importance of computational modeling arises from the low cost and high speed in comparison to biological lab experiments.

One class of methods for state prediction used generative models in the form of position weight matrices (PWM; Stormo, 2000). These methods construct motifs, or short contiguous sequences (often 8–20 bp in length), which are representative of a particular chromatin profile label such as a TF binding. A new sequence can then be classified according to how well it matches the motif. A significant drawback of using predefined motif features is that it is difficult to find the correct motifs for predicting unseen sequences (Alipanahi et al., 2015). Another class of methods uses string kernels (SKs; Ghandi et al., 2014; Singh et al., 2017), where some kernel function is built to capture the similarity between DNA segments according to substring patterns. However, these methods suffer from the issue of a predefined feature engineering. Moreover, these methods do not scale to a large number of sequences (Zhou and Troyanskaya 2015).

To overcome the issues of PWM and SK methods, researchers turned to automatic feature extraction using deep neural networks which have outperformed both generative PWM and SK methods (Alipanahi et al., 2015; Zhou and Troyanskaya 2015). CNNs were the first deep learning method to outperform previous methods. CNNs have been used extensively to learn features of DNA for sequence-based prediction (Alipanahi et al., 2015; Hassanzadeh and Wang, 2016; Lanchantin et al., 2016; Zhou and Troyanskaya 2015; Zhou et al., 2018). The benefit of convolutional models is that they have an inductive bias for modeling translation invariant features in DNA sequences. This allows CNNs to effectively learn the correct ‘motifs’ or kernels for chromatin labeling. There has since been several revisions to the original CNN models for marginally better feature extraction, such as adding a recurrent network on top of the CNN motif features (Lanchantin et al., 2017; Quang and Xie, 2016).

However, current state-of-the-art models only learn the features from the sequences of individual local windows and not between windows (i.e. longer-range interactions). Since DNA interacts with itself in the form of long-range 3D contacts, labeling the chromatin profiles of a window can be affected by another distant window. (Kelley et al., 2018) use longer-range dependencies (32 bp), but the dependencies are modeled implicitly using dilated convolution across 128 bp windows. Accordingly, methods that account for explicit long-range 3D chromatin contacts are needed to model the true interactions in DNA.

2.2 DNA interactions via Hi-C maps

Hi-C experiments and 3C experiments in general, are biological methods used to analyze the spatial organization of chromatin in a cell. These methods quantify the number of interactions between genomic loci. Two loci that are close in 3D space due to chromatin folding may be separated by up to millions of nucleotides in the sequential genome. Such interactions may result from biological functions, such as protein interactions (Hakim and Misteli, 2012). Supplementary Appendix Figure SA10 shows an example Hi-C map from the GM12878 cell line. The darker the lines indicate more DNA-DNA interactions.

Since the first Hi-C maps were generated, many works have been introduced to analyze the maps (Ma et al., 2018) investigated the spatial relationships of co-localized TF binding sites (TFBS) within the 3D genome. They show that for certain TFs, there is a positive correlation of occupied binding sites with their spatial proximity in the 3D space. This is especially apparent for weak TFBS and at enhancer regions (Bailey et al., 2015) identified that the ZNF143 TF motif in the promoter regions provides sequence specificity for long-range promoter–enhancer interactions (Wong, 2017) identified coupling DNA motif pairs on long-range chromatin interactions (Schreiber et al., 2018) use CNNs to predict Hi-C interactions from sequence inputs. None of the previous methods, however, use known Hi-C data to learn better feature representations of genomic sequences for chromatin profile prediction.

2.3 Graph convolutional networks

GCNs were recently introduced to model non-local or -smooth data (Dai et al., 2016; Gilmer et al., 2017; Hamilton et al., 2017; Kipf and Welling, 2016; Scarselli et al., 2008; Veličković et al., 2017). For the task of node classification, GCNs can learn useful node representations which encode both node-level features and relationships between connected nodes. Essentially, GCNs learn node representations by encoding local graph structures and node attributes, and the whole framework can be trained in an end-to-end fashion. Because of their effectiveness in learning graph representations, they achieve state-of-the-art results in node classification. The main assumption is that the input samples (in our case, individual DNA windows) are not independent. By modeling the graph dependency between samples, we can obtain a better representation of each of the samples. Non-local neural networks (Wang et al., 2018) are an instantiation of graph convolution, which was designed to model the long-range interactions in video frames.

3 Problem formulation and data processing

The objective of chromatin profile prediction (i.e. chromatin effect prediction) is to tag segments of DNA with the probability of how likely a certain chromatin effect (aka chromatin profile label) is present. In our formulation, we define chromatin profile labels to include TF binding, HMs and accessibility (DNase I). This is known as a multi-label classification task, where multiple labels can be positive at once (different from multi-class tasks where only one label can be positive). Formally, given an input DNA window xi (a segment of length T), we want to predict yil{0,1} for a label l, where l ranges from 1 to L.

3.1 Sequence data

We derive epigenetic labels using ChIP-seq data from ENCODE (ENCODE Project Consortium, 2004). We use the cell lines GM12878 and K562, two of the most widely used from ENCODE and Roadmap (ENCODE Project Consortium, 2004; Kundaje et al., 2015). For each cell line, we use all windows which have at least one positive epigenetic ChIP-seq peak. We consider any peak from ENCODE to be a positive peak. We follow a similar setup as in (Zhou and Troyanskaya, 2015) where we bin the DNA into 1000 bp windows. If any ChIP-seq peak overlaps with at least 100 bp of a particular window, we consider that a positive window for that chromatin label. We then extract the 2000 bp sequence surrounding the center of each window as the input features, as done in Zhou et al. (2018), since the motif for a particular signal may not be contained fully in the 1000 bp length window. Although we use the 2000 bp sequence, we consider each window to be the original non-overlapping 1000 bp for notation purposes. An illustration of how sequences are extracted is shown in Supplementary Appendix Figure SA4a. Following Zhou and Troyanskaya (2015), we use chromosome 8 for testing and also add Chromosomes 1 and 21. Chromosomes 3, 12 and 17 are used for validation, and all other chromosomes (excluding X and Y) are used for training. The datasets are summarized in Table 1.

Table 1.

Datasets summary

Cell lineTrain windowsValid windowsTest windowsTFsHMsDNase I
GM12878368 08289 91179 73190112
K562457 609106 777117 815150122
Cell lineTrain windowsValid windowsTest windowsTFsHMsDNase I
GM12878368 08289 91179 73190112
K562457 609106 777117 815150122

Note: GM12878 contains 103 total chromatin profile labels, and K562 contains 164 total labels. We use the same chromosomes for training, validation, and testing for both datasets.

Table 1.

Datasets summary

Cell lineTrain windowsValid windowsTest windowsTFsHMsDNase I
GM12878368 08289 91179 73190112
K562457 609106 777117 815150122
Cell lineTrain windowsValid windowsTest windowsTFsHMsDNase I
GM12878368 08289 91179 73190112
K562457 609106 777117 815150122

Note: GM12878 contains 103 total chromatin profile labels, and K562 contains 164 total labels. We use the same chromosomes for training, validation, and testing for both datasets.

3.2 3D Genome data

We then use 3D genome, also known as chromatin conformation, data from Hi-C contact maps to extract interaction evidence between the DNA windows. We use 1000 bp resolution intra-chromosome Hi-C data from (Rao et al., 2014; for K562, the lowest resolution is 5000 bp, so we upsample to get 1000 bp resolution). We convert the Hi-C contact map for each chromosome into a graph whose nodes are 1000 bp DNA windows and whose edges represent contact between two 1000 bp windows. Since the full Hi-C contact for each chromosome is too dense, we rank each contact edge, and use the top 500 000 Hi-C contacts as edges per chromosome (each chromosome maps to a Hi-C graph). Contacts are ranked using the SQRTVC normalization from (Rao et al., 2014), which normalizes for the distance between two positions so that long-range contacts are included in the top 500 k. While we chose these specific DNA window and Hi-C resolution sizes, we want to emphasize that our model is applicable to any window and Hi-C data sizes.

4 Materials and methods

Our goal is to learn a model f which takes in a DNA subsequence window xi and predicts the probability of a set of epigenetic labels y^i=f(xi), where y^i is an L dimensional vector. Our method, ChromeGCN uses three submodules for f: fCNN, fGCN and fPred. The first module, fCNN, models local sequence patterns from each window using a CNN. This module takes as input xi and outputs a vector representation hi=fCNN(xi). The second module, fGCN, models long-range 3D genome dependencies between windows using a GCN. This module takes as input all window vectors hi concatenated as H, as well as their Hi-C relationships represented by adjacency matrix A, and outputs refined representations of all windows Z=fGCN(H,A). The zi of each window now encodes both window sequence patterns and the relationships between windows. We can then predict the epigenetic labels using a classifier function on each zi using y^i=fPred(zi). An overview of ChromeGCN is shown in Figure 1c. The following sections explain each submodule in detail.

4.1 Modeling local sequence representations using CNNs

Following the recent successes in many epigenetic label prediction tasks (Alipanahi et al., 2015; Lanchantin et al., 2017; Quang and Xie, 2016 ; Zhou and Troyanskaya, 2015; Zhou et al., 2018), we learn a representation of each genomic window sequence xi using a CNN. CNNs have become the de facto standard for encoding short DNA windows due to their properties, which effectively capture local sequence structure. Each learned kernel, or filter, in CNNs effectively learns a DNA ‘motif’, or short contiguous sequence representative of a particular output label (Alipanahi et al., 2015). Since many epigenetic processes are hypothesized to be dependent on motifs (Bailey et al., 2015), CNNs are a good choice for encoding DNA.

This module, fCNN, takes an input genomic sequence window xi, and outputs an embedding representation vector hi. We represent window xi of length τ as a one-hot encoded matrix Xiτ×nin, where nin is 4, representing the base-pair characters A, C, G, and T. Convolution with filters (i.e. learned motifs) of length k<τ takes an input data matrix Xi of size τ×nin, and outputs a matrix Xi of size τ×nout, where nout is the chosen dimension of the learned hidden representations:
(1)
where Wnout×nin×k are the trainable weights, and σ is a function enforcing element-wise nonlinearity.

Equation (1) can then be repeated for several layers where each successive layer uses a new W and nin is replaced with nout from the previous layer. In our implementation, we use six layers of convolution where each successive layer learns higher-order motifs of the window. After the convolutional layers, the output of the last layer is flattened into a vector and then linearly transformed into a lower-dimensional vector of size d, which we denote hi. Succinctly, the CNN module computes the following: hi=fCNN(xi) for each window.

4.2 Modeling long-range 3D genome relationships using GCNs

Although CNN models work well on independent local window sequences, they disregard known long-range relationships between windows that are influential in the chromatin profile. One option would be to extend the window size. However, due to the 3D shape of DNA, long-range contact dependencies may be located millions of base-pairs apart, making current convolutional models infeasible. In this section, we introduce the fGCN module, a method to explicitly and efficiently model such long-range interactions using GCNs.

Known long-range relationships in the 3D genome are available in the form of Hi-C contact maps. A Hi-C map can be represented as an adjacency matrix A, where the non-zero elements indicate contacts in the 3D space between two DNA windows [in our experiments, we use a different adjacency matrix A for each chromosome (intra-chromosome Hi-C maps). However, we generalize a A to represent all possible window interactions (i.e. including inter-chromosome maps)]. In our formulation, we represent sequence windows xi as nodes on a graph, and A are the edges between the windows. We can then use a modified version of GCNs (Kipf and Welling, 2016) to update each xi with its neighboring windows xj,jN(i), where N(i) denotes the neighbors of node i obtained from A. The GCN works by revising a window’s representation hit, where hi0 is from the output of the first module, fCNN. We denote t to represent the GCN layer index. Specifically, each hit is revised using a parameterized summation of neighbors, hjt,jN(i):
(2)
where σ(·) is a non-linear activation function such as tanh and Wtd×d is a linear feature transformation matrix for the GCN layer t. Importantly, using the summation in Equation (2), the representation of each DNA window xi is updated based on the representation of its neighbors (windows that interact with xi in the 3D genome). We can compute the simultaneous update of all windows together by concatenating all hi denoted HtN×d where N is the number of DNA windows and d is the dimension of each hi. The simultaneous update can then be written as:
(3)
where A=D^12(A+I)D^12, is the normalized adjacency matrix and D^ is the diagonal degree matrix of (A+I).
In our experiments, we use a variant of the GCN, which uses a gating function allowing the model to use or not use neighboring windows to update each hit:
(4)
 
(5)
 
(6)
where Wzt is a linear transformation matrix, 1 is a vector of all 1 s, and wgtd is used to compute the gating vector. Gating vector gt allows the model to selectively choose between using the neighborhood representation of nodes, H˜t, or the independent representation, Ht. Equations (3–6) indicate one layer update of GCN window embeddings H. In our experiments, we use a two-layer gated GCN (i.e. t =2), and we denote the final output of all Ht as Z, where vector zi of Z represents the output of window i. In summary, the GCN module computes the following: Z=fGCN(H,A).

4.3 Predicting label probabilities for each window

After Z is computed, we then use a linear classifier layer, fPred to classify each zi into its output space (a set of epigenetic labels): y^i=σ(ziW+b). In summary, the prediction y^i for a particular input sample xi can be decomposed as three steps:
 
 

4.4 Identifying and interpreting important Hi-C edges via Hi-C saliency maps

One benefit of the ChromeGCN formulation is that we use explicit, or known, long-range window dependencies for chromatin profile prediction. As a result, we propose a method to identify the important dependencies for ChromeGCN’s predictions. We call our proposed method Hi-C saliency maps. Saliency maps were introduced by Simonyan et al. (2013) to understand the importance of each pixel in an input xi for the prediction of the image’s true class. We instead are trying to understand the importance of each edge in A for the prediction of the chromatin profile over all windows. The A saliency map is defined as the absolute value gradient of a true class prediction y^i with respect to A, where is a true class for sample i. The absolute value gradient is then element-wise multiplied by A to zero out ‘non-edges’. Since there are N samples, and there can be multiple true labels for a particular sample yi^, we define the Hi-C saliency map, SHi-C, as the accumulated absolute gradient over all samples and true labels w.r.t A:
(7)
where ° is the Hadamard product. Since the saliency map of all windows N is accumulated, we normalize SHi-C across each row to a 0–1 range so that we can interpret the edges at each window.

While Hi-C contact maps tell us where the contacts are, Hi-C saliency maps show us how important each contact is for the chromatin profiles. We define Equation (7) to be over all labels, but we can easily visualize the Hi-C saliency, or important edges for one particular label. We show both the full Hi-C saliency across all labels, as well as for one specific label (YY1) in the experiments.

4.5 Model variations

To test the effectiveness of the long-range dependencies, we use the following ChromeGCN variations. Each variation uses the same model, with different edge dependencies in the form of A.

ChromeGCNconst: Instead of using Hi-C edges we use a constant set of nearby neighbors according to the 1D sequential DNA representation. We define each window xi’s neighbors to be the windows surrounding xi (7 on each side: xi7,,xi+7) which we denote as Aconst. Z=GCN(Aconst,H). This variant allows us to see whether the very long-range interactions from the normalized Hi-C maps are useful.

ChromeGCNHi-C: This variation uses the original Hi-C adjacency matrix, AHi-C. Z=GCN(AHi-C,H).

Hi-C contacts are close neighboring contacts. However, by using top 500k contacts after the SQRT normalization for the Hi-C graph, we reduced some of this locality bias in the graph. This results in many of the edges being far away in the 1D space. This allows us to decouple the effects of local neighboring contacts (constant) and long-range (normalized Hi-C) contacts.

ChromeGCNconst+Hi-C: Last, we use a combination of the constant neighborhood around each window and the Hi-C adjacency matrix, which integrates close and far windows for each window. This variation uses the following function: Z=GCN(Aconst+Hi-C,H).

4.6 Model details and training

To circumvent GPU memory constraints of training end-to-end, we pre-train the fCNN model by classifying each hi with the classification function y^i=fPred(hi). Once the pre-training converges on the training set, we use the trained weights hi for each sample as fixed inputs to fGCN. Although we pre-train fCNN, ChromeGCN is still end-to-end differentiable, making it possible to use sequence visualization methods such as DeepLIFT (Shrikumar et al., 2017) for a particular window.

For all model predictions, we run the forward and the reverse complement through simultaneously and average the output of the two. All DNA window inputs are encoded using a lookup table that maps each character A, C, G, T and N (unknown) to a d-dimensional vector. The output of the encoding is a d×τ matrix, where τ denotes sequence length (τ = 2000 in our experiments).

All of our models are trained using stochastic gradient descent with momentum of 0.9 and a learning rate of 0.25. The CNN model is trained using a batch size of 64, and the GCN and RNN models are trained using an entire chromosome as a batch (since each is modeling the between window dependencies of an entire chromosome at once). The CNN model projects each window to a vector of dimension 128. The GCN uses two layers of feature dimension 128 at each layer.

ChromeGCN predicts the probabilities of all labels for each window: yi^L, where L is the total number of labels. For our loss function, we use the mean binary cross-entropy across all samples N and labels L:
(8)

5 Experiments and results

5.1 Baselines

We compare against the state-of-the-art chromatin profile prediction model from (Zhou et al., 2018), referred to as the CNN baseline, as well as the recurrent model from (Quang and Xie, 2016). Since our model outputs labels for TFBS, HMs and accessibility, motif-based methods (Bailey et al., 2009) aren’t applicable. Since we have 368 082 training samples, kernel-based methods such as (Ghandi et al., 2014) aren’t applicable (Zhou and Troyanskaya, 2015) compared their CNN to a modified version of (Ghandi et al., 2014), which only used a small number of training samples, and the CNN model was significantly better. Our CNN baseline, (Zhou et al., 2018) is an improved version from Zhou and Troyanskaya (2015). Furthermore, the focus of our study is to show that state-of-the-art deep learning models are missing important long-range dependencies in the genome.

CNN: To illustrate the importance of the GCN, we compare against the outputs from the fCNN module: y^i=fPred(hi). This is the six-layer CNN model from Zhou et al. (2018; we modify the last layer in order to extract a d-dimensional feature vector output). This is the same CNN that is pre-trained for ChromeGCN to produce each hi.

DanQ: This model uses a RNN on top of CNN outputs within a window. It still uses local sequence window inputs, but models relationships between sequence patterns via an LSTM (Quang and Xie, 2016).

ChromeRNN: As a baseline to compare against using GCNs for long-range dependency modeling, we construct an RNN model on the window embeddings h1,h2,,hN. After pre-training the CNN module fCNN, The RNN model takes in all window embeddings at once and models the sequential dependencies among windows: (z1,z2,z3,,zN)=fRNN(h1,h2,h3,,hN). As with ChromeGCN, the RNN is shared across chromosomes, but does not cross chromosomes. In other words, the embeddings are updated one chromosome at a time fRNN(hi,hi+1,,hC), where C is the total number of windows for a chromosome. We note this is different from the DanQ baseline (Quang and Xie, 2016) which uses an RNN within windows (Quang and Xie, 2016). ChromeRNN instead is for modeling dependencies between windows. In our experiments, we use an LSTM (Quang and Xie, 2016) with the same number of layers and hidden units as the GCN.

5.2 Prediction performance

To evaluate the methods, we use area under the ROC curve (AUROC), area under the precision-recall curve (AUPR), and the mean recall at 50% false discovery rate (FDR) cutoff. Table 2 shows the mean metric results across all epigenetic labels for each cell line. Modeling long-range dependencies results in significant improvements over the baseline CNN model, which does not account for such long-range interactions. For instance, with respect to AUPR for GM12878, ChromeRNN improves upon the CNN from 0.350 to 0.384 and ChromeGCN outperforms ChromeRNN to achieve a mean AUPR of 0.395. Also, we see that the ChromeRNN outperforms DanQ, indicating that using an RNN to model between window dependencies is more important than within window features. Moreover, we can see that the ChromeGCNHi-C models outperform ChromeRNN, indicating that not only the closely neighboring windows in the 1D space contribute to the improvements, but also the close neighbors in the 3D space, as indicated by the used Hi-C maps. ChromeGCN outperforms the baselines on the TF and DNase I labels, and ChromeRNN outperforms all other methods on the HM labels. This indicates that non-local modeling is particularly important for TF binding and accessibility. We provide the performance results of each label type (TF, HM and DNase I) are shown in Tables 3 and 4. We also provide detailed plots of ROC and precision-recall curves in Supplementary Appendix Figures SA5–8.

Table 2.

Performance results

GM12878
K562
Mean AUROCMean AUPRMean Recall at 50% FDRMean AUROCMean AUPRMean Recall at 50% FDR
CNN (Zhou et al., 2018)0.8950.3500.2930.8940.3250.265
DanQ (Quang and Xie, 2016)0.8860.3480.2900.9000.3430.290
ChromeRNN0.9060.3840.3420.9100.3650.327
ChromeGCNconst0.9040.3770.3310.9040.3580.321
ChromeGCNHi-C0.9040.3850.3410.9030.3580.319
ChromeGCNconst+Hi-C0.9090.3950.3560.9120.3720.338
GM12878
K562
Mean AUROCMean AUPRMean Recall at 50% FDRMean AUROCMean AUPRMean Recall at 50% FDR
CNN (Zhou et al., 2018)0.8950.3500.2930.8940.3250.265
DanQ (Quang and Xie, 2016)0.8860.3480.2900.9000.3430.290
ChromeRNN0.9060.3840.3420.9100.3650.327
ChromeGCNconst0.9040.3770.3310.9040.3580.321
ChromeGCNHi-C0.9040.3850.3410.9030.3580.319
ChromeGCNconst+Hi-C0.9090.3950.3560.9120.3720.338

Note: For both cell lines, GM12878 and K562, we show the average across all labels for three different metrics. Our method, using a GCN to model long-range dependencies helps improve performance over the baseline CNN model which assumes all DNA segments are independent. Detailed results of all cell lines and label types are shown in the Supplementary Appendix. Best performing methods are shown in bold.

Table 2.

Performance results

GM12878
K562
Mean AUROCMean AUPRMean Recall at 50% FDRMean AUROCMean AUPRMean Recall at 50% FDR
CNN (Zhou et al., 2018)0.8950.3500.2930.8940.3250.265
DanQ (Quang and Xie, 2016)0.8860.3480.2900.9000.3430.290
ChromeRNN0.9060.3840.3420.9100.3650.327
ChromeGCNconst0.9040.3770.3310.9040.3580.321
ChromeGCNHi-C0.9040.3850.3410.9030.3580.319
ChromeGCNconst+Hi-C0.9090.3950.3560.9120.3720.338
GM12878
K562
Mean AUROCMean AUPRMean Recall at 50% FDRMean AUROCMean AUPRMean Recall at 50% FDR
CNN (Zhou et al., 2018)0.8950.3500.2930.8940.3250.265
DanQ (Quang and Xie, 2016)0.8860.3480.2900.9000.3430.290
ChromeRNN0.9060.3840.3420.9100.3650.327
ChromeGCNconst0.9040.3770.3310.9040.3580.321
ChromeGCNHi-C0.9040.3850.3410.9030.3580.319
ChromeGCNconst+Hi-C0.9090.3950.3560.9120.3720.338

Note: For both cell lines, GM12878 and K562, we show the average across all labels for three different metrics. Our method, using a GCN to model long-range dependencies helps improve performance over the baseline CNN model which assumes all DNA segments are independent. Detailed results of all cell lines and label types are shown in the Supplementary Appendix. Best performing methods are shown in bold.

Table 3.

Performance on GM12878 for each label category

TFs
HMs
DNase I
Mean AUROCMean AUPRMean recall at 50% FDRMean AUROCMean AUPRMean recall at 50% FDRMean AUROCMean AUPRMean recall at 50% FDR
CNN (Zhou et al., 2018)0.9090.3280.2620.7960.4780.4690.8010.6570.750
DanQ (Quang and Xie, 2016)0.8990.3250.2600.7940.4790.4630.7920.6500.723
ChromeRNN0.9140.3540.2990.8590.5740.6100.8210.6890.787
ChromeGCNconst0.9130.3480.2910.8490.5540.580.8150.680.777
ChromeGCNHi-C0.9160.3620.3090.8190.5160.5280.8140.6830.774
ChromeGCNconst+Hi-C0.9180.3690.3190.8450.5520.5840.8200.6920.783
TFs
HMs
DNase I
Mean AUROCMean AUPRMean recall at 50% FDRMean AUROCMean AUPRMean recall at 50% FDRMean AUROCMean AUPRMean recall at 50% FDR
CNN (Zhou et al., 2018)0.9090.3280.2620.7960.4780.4690.8010.6570.750
DanQ (Quang and Xie, 2016)0.8990.3250.2600.7940.4790.4630.7920.6500.723
ChromeRNN0.9140.3540.2990.8590.5740.6100.8210.6890.787
ChromeGCNconst0.9130.3480.2910.8490.5540.580.8150.680.777
ChromeGCNHi-C0.9160.3620.3090.8190.5160.5280.8140.6830.774
ChromeGCNconst+Hi-C0.9180.3690.3190.8450.5520.5840.8200.6920.783

Note: Best performing methods are shown in bold.

Table 3.

Performance on GM12878 for each label category

TFs
HMs
DNase I
Mean AUROCMean AUPRMean recall at 50% FDRMean AUROCMean AUPRMean recall at 50% FDRMean AUROCMean AUPRMean recall at 50% FDR
CNN (Zhou et al., 2018)0.9090.3280.2620.7960.4780.4690.8010.6570.750
DanQ (Quang and Xie, 2016)0.8990.3250.2600.7940.4790.4630.7920.6500.723
ChromeRNN0.9140.3540.2990.8590.5740.6100.8210.6890.787
ChromeGCNconst0.9130.3480.2910.8490.5540.580.8150.680.777
ChromeGCNHi-C0.9160.3620.3090.8190.5160.5280.8140.6830.774
ChromeGCNconst+Hi-C0.9180.3690.3190.8450.5520.5840.8200.6920.783
TFs
HMs
DNase I
Mean AUROCMean AUPRMean recall at 50% FDRMean AUROCMean AUPRMean recall at 50% FDRMean AUROCMean AUPRMean recall at 50% FDR
CNN (Zhou et al., 2018)0.9090.3280.2620.7960.4780.4690.8010.6570.750
DanQ (Quang and Xie, 2016)0.8990.3250.2600.7940.4790.4630.7920.6500.723
ChromeRNN0.9140.3540.2990.8590.5740.6100.8210.6890.787
ChromeGCNconst0.9130.3480.2910.8490.5540.580.8150.680.777
ChromeGCNHi-C0.9160.3620.3090.8190.5160.5280.8140.6830.774
ChromeGCNconst+Hi-C0.9180.3690.3190.8450.5520.5840.8200.6920.783

Note: Best performing methods are shown in bold.

Table 4

Performance on K562 for each label category.

TFs
HMs
DNase I
Mean AUROCMean AUPRMean Recall at 50% FDRMean AUROCMean AUPRMean Recall at 50% FDRMean AUROCMean AUPRMean Recall at 50% FDR
CNN (Zhou et al., 2018)0.9050.3140.2560.780.4100.3170.7920.6250.683
DanQ (Quang and Xie, 2016)0.9090.3310.2790.7960.4380.3650.7970.6380.693
ChromeRNN0.9170.3490.3050.8480.5200.5280.8170.6640.742
ChromeGCNconst0.9110.3420.3010.840.5070.5070.8100.6570.729
ChromeGCNHi-C0.9120.3460.3060.8070.4540.4120.8120.6570.730
ChromeGCNconst+Hi-C0.9190.3580.3200.8370.4950.4840.8210.670.752
TFs
HMs
DNase I
Mean AUROCMean AUPRMean Recall at 50% FDRMean AUROCMean AUPRMean Recall at 50% FDRMean AUROCMean AUPRMean Recall at 50% FDR
CNN (Zhou et al., 2018)0.9050.3140.2560.780.4100.3170.7920.6250.683
DanQ (Quang and Xie, 2016)0.9090.3310.2790.7960.4380.3650.7970.6380.693
ChromeRNN0.9170.3490.3050.8480.5200.5280.8170.6640.742
ChromeGCNconst0.9110.3420.3010.840.5070.5070.8100.6570.729
ChromeGCNHi-C0.9120.3460.3060.8070.4540.4120.8120.6570.730
ChromeGCNconst+Hi-C0.9190.3580.3200.8370.4950.4840.8210.670.752

Note: Best performing methods are shown in bold.

Table 4

Performance on K562 for each label category.

TFs
HMs
DNase I
Mean AUROCMean AUPRMean Recall at 50% FDRMean AUROCMean AUPRMean Recall at 50% FDRMean AUROCMean AUPRMean Recall at 50% FDR
CNN (Zhou et al., 2018)0.9050.3140.2560.780.4100.3170.7920.6250.683
DanQ (Quang and Xie, 2016)0.9090.3310.2790.7960.4380.3650.7970.6380.693
ChromeRNN0.9170.3490.3050.8480.5200.5280.8170.6640.742
ChromeGCNconst0.9110.3420.3010.840.5070.5070.8100.6570.729
ChromeGCNHi-C0.9120.3460.3060.8070.4540.4120.8120.6570.730
ChromeGCNconst+Hi-C0.9190.3580.3200.8370.4950.4840.8210.670.752
TFs
HMs
DNase I
Mean AUROCMean AUPRMean Recall at 50% FDRMean AUROCMean AUPRMean Recall at 50% FDRMean AUROCMean AUPRMean Recall at 50% FDR
CNN (Zhou et al., 2018)0.9050.3140.2560.780.4100.3170.7920.6250.683
DanQ (Quang and Xie, 2016)0.9090.3310.2790.7960.4380.3650.7970.6380.693
ChromeRNN0.9170.3490.3050.8480.5200.5280.8170.6640.742
ChromeGCNconst0.9110.3420.3010.840.5070.5070.8100.6570.729
ChromeGCNHi-C0.9120.3460.3060.8070.4540.4120.8120.6570.730
ChromeGCNconst+Hi-C0.9190.3580.3200.8370.4950.4840.8210.670.752

Note: Best performing methods are shown in bold.

Furthermore, since ChromeGCN models the window relationships explicitly and not recurrently, we obtain a significant speedup at test time over the ChromeRNN baseline. ChromeGCN achieves a 6.3× speedup on the three GM12878 test chromosomes and 6.8× speedup at test time on the three K562 test chromosomes.

5.3 Analysis of using Hi-C data

Figure 2 shows a detailed comparison of ChromeGCNHi-C versus the baseline CNN model across three different metrics for both GM12878 and K562. Each point represents a label, and the y-axis shows the absolute improvement of the ChromeGCNHi-C model over the CNN. The labels are sorted on the x-axis by the average degree of the label’s positive samples (i.e. windows where the label is positive) on the Hi-C map. We can see that for all three metrics, the improvements of the ChromeGCN over the CNN increase as the average degree of the labels increase. This indicates that the ChromeGCN is important for labels that have many neighbors in the Hi-C graph (i.e. those that are frequently in contact with other segments in the 3D space). Two of the TFs which obtain the highest performance increase (in the top 5) from using ChromeGCN over CNN, CEBPB STAT3, are validated by Ma et al. (2018), which show that these two TFs commonly co-occur with other TFs in the 3D space when binding.

Comparison of our ChromeGCNHi-C method versus the baseline CNN (Zhou et al., 2018) for three metrics. Each point represents one chromatin profile label. The labels are sorted in the x-axis by the average degree of their positive windows. The y-axis indicates absolute increase of the ChromeGCNHi-C over the CNN model. As the average degree increases, the improvement of the ChromeGCNHi-C model increases over the CNN. Green points indicate ChromeGCNHi-C performed better and red indicates the CNN performed better. The blue line shows the linear trend line. The ChromeGCNHi-C is significantly better, as demonstrated by the P values from a pairwise t-test
Fig. 2.

Comparison of our ChromeGCNHi-C method versus the baseline CNN (Zhou et al., 2018) for three metrics. Each point represents one chromatin profile label. The labels are sorted in the x-axis by the average degree of their positive windows. The y-axis indicates absolute increase of the ChromeGCNHi-C over the CNN model. As the average degree increases, the improvement of the ChromeGCNHi-C model increases over the CNN. Green points indicate ChromeGCNHi-C performed better and red indicates the CNN performed better. The blue line shows the linear trend line. The ChromeGCNHi-C is significantly better, as demonstrated by the P values from a pairwise t-test

The P-values shown are computed by a pairwise t-test across all labels. The ChromeGCNHi-C model significantly outperforms the CNN model in all three metrics. Importantly, our results indicate that by using the long-range interactions given by Hi-C data, we can obtain improvements in modeling the chromatin state labeling, resulting in better classification accuracy.

5.4 Long-range interaction visualization

A significant merit of ChromeGCN is that by using known 3D genome relationships, we can find and visualize the critical relationships for chromatin profile prediction. To understand how important the Hi-C edges are for the predictions of ChromeGCN, we visualize the saliency map of A, as explained in Section 4.4.

Figure 3 shows the Hi-C saliency map for Chromosome 8 in GM12878. Figure 3 (left) shows all 500k Hi-C contacts used Chromosome 8. Windows are represented as points along the circle, with a total of 23 600 windows. Lines between the windows represent Hi-C edges, and the darkness of the line represents the saliency, or importance of that edge for chromatin state prediction across all windows in Chromosome 8. Figure 3 (right) shows the saliency map for 250 windows (total of 250 bp input) in Chromosome 8. Cell (i, j) tells us the importance of window column j for the prediction of window row i labels (Fig. 4).

Hi-C Saliency Map Visualization. Left: Saliency Map for the all 500k edges in AHi-C for GM12878 Chromosome 8 (total of 23 600 windows). The darker the line, the more important that edge was for predicting the correct Chromatin profile, indicating that the Hi-C data were used by the GNN for that particular interaction. Right: Fine-grained analysis of the Chromosome 8 Saliency Map. This figure shows the normalized Saliency Map values for 250 windows (total of 250 bp input) in Chromosome 8
Fig. 3.

Hi-C Saliency Map Visualization. Left: Saliency Map for the all 500k edges in AHi-C for GM12878 Chromosome 8 (total of 23 600 windows). The darker the line, the more important that edge was for predicting the correct Chromatin profile, indicating that the Hi-C data were used by the GNN for that particular interaction. Right: Fine-grained analysis of the Chromosome 8 Saliency Map. This figure shows the normalized Saliency Map values for 250 windows (total of 250 bp input) in Chromosome 8

Although Figure 3 shows the Hi-C saliency map for all chromatin profile labels, we can also visualize the Hi-C saliency map for individual labels. The inner loop of Equation (7) changes to only use the label of interest. In Supplementary Appendix Figure SA9, we show the Hi-C saliency map for the YY1 TF on GM12878 Chromosome 8. This gives us insight into the important 3D contacts for YY1 binding.

6 Conclusion

In this work, we present ChromeGCN, a novel framework that combines both local sequence and long-range 3D genome data (via Hi-C data) for chromatin profile prediction. We show experimentally that ChromeGCN outperforms previous state-of-the-art methods that only use local sequence data. Additionally, we show that we can identify and visualize the important 3D genome dependencies using the proposed Hi-C saliency maps. We plan to further investigate the value of Hi-C saliency maps in future work.

In this work, we demonstrate our ChromeGCN on two well-known cell types. However, it is important to be able to port out method to other cell types. Fortunately, our method is broadly generalizable for any cell type where there is known epigenetics and HiC data. Additionally users can use any cell type which has HiC alike data (e.g. HiChIP or ChIA-PET). This is an important distinction of our work—all the user needs is data which denotes where the long-range relationships are in the genome. Furthermore, if a cell type does not have any training labels, we can use the genomic features and graph of a related cell type and perform transfer learning to predict on the new cell type. We plan to extend our work into this domain.

Although we demonstrate the importance of ChromeGCN on the task of chromatin profile prediction, ChromeGCN is a generic model for incorporating 3D genome structure into any genome sequence prediction task. We hope that this work encourages researchers to use known long-range relationships from 3D genome data when constructing machine learning models. We plan to release our data and code for greater visibility. ChromeGCN introduces an effective and efficient framework to model such relationships for better chromatin modeling, as well as an easy way to interpret important relationships.

Financial Support: This work was partly supported by the National Science Foundation under NSF CAREER award No. 1453580. Any Opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect those of the National Science Foundation.

Conflict of Interest: none declared.

References

Alipanahi
B.
 et al. (
2015
)
Predicting the sequence specificities of DNA-and RNA-binding proteins by deep learning
.
Nat. Biotechnol
.,
33
,
831
838
.

Ay
F.
 et al. (
2014
)
Statistical confidence estimation for Hi-C data reveals regulatory chromatin contacts
.
Genome Res
.,
24
,
999
1011
.

Bailey
S.D.
 et al. (
2015
)
Znf143 provides sequence specificity to secure chromatin interactions at gene promoters
.
Nat. Commun
.,
6
,
6186
.

Bailey
T.L.
 et al. (
2009
)
Meme suite: tools for motif discovery and searching
.
Nucleic Acids Res
.,
37
,
W202
W208
.

Brackley
C.A.
 et al. (
2012
)
Facilitated diffusion on mobile DNA: configurational traps and sequence heterogeneity
.
Phys. Rev. Lett
.,
109
,
168103
.

ENCODE Project Consortium. (

2004
)
The ENCODE (encyclopedia of DNA elements) project
.
Science
,
306
,
636
640
.

Dai
H.
 et al. (
2016
)
Discriminative embeddings of latent variable models for structured data
. In
International Conference on Machine Learning
, pp.
2702
2711
.

Dai
Z.
 et al. (
2019
) Transformer-xl: attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860.

Devlin
J.
 et al. (
2018
) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

Ghandi
M.
 et al. (
2014
)
Enhanced regulatory sequence prediction using gapped k-mer features
.
PLoS Comput. Biol
.,
10
,
e1003711
.

Gilmer
J.
 et al. (
2017
) Neural message passing for quantum chemistry. arXiv preprint arXiv:1704.01212.

Hakim
O.
,
Misteli
T.
(
2012
)
Snapshot: chromosome conformation capture
.
Cell
,
148
,
1068
.
e1–2
.

Hamilton
W.L.
 et al. (
2017
) Representation learning on graphs: methods and applications. arXiv preprint arXiv:1709.05584.

Hassanzadeh
H.R.
,
Wang
M.D.
(
2016
) Deeperbind: enhancing prediction of sequence specificities of DNA binding proteins. In 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp.
178
183
. IEEE.

Hochreiter
S.
,
Schmidhuber
J.
(
1997
)
Long short-term memory
.
Neural Comput
.,
9
,
1735
1780
.

Kelley
D.R.
 et al. (
2018
)
Sequential regulatory activity prediction across chromosomes with convolutional neural networks
.
Genome Res
.,
28
,
739
750
.

Kipf
T.N.
,
Welling
M.
(
2016
) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907.

Kundaje
A.
 et al. ; Roadmap Epigenomics Consortium. (
2015
)
Integrative analysis of 111 reference human epigenomes
.
Nature
,
518
,
317
330
.

Lanchantin
J.
 et al. (
2016
) Deep motif: visualizing genomic sequence classifications. arXiv preprint arXiv:1605.01133.

Lanchantin
J.
 et al. (
2017
) Deep motif dashboard: visualizing and understanding genomic sequences using deep neural networks. In Pacific Symposium on Biocomputing 2017, pp.
254
265
. World Scientific.

Ma
X.
 et al. (
2018
)
Canonical and single-cell hi-c reveal distinct chromatin interaction sub-networks of mammalian transcription factors
.
Genome Biol
.,
19
,
174
.

Mifsud
B.
 et al. (
2015
)
Mapping long-range promoter contacts in human cells with high-resolution capture Hi C
.
Nat. Genet
.,
47
,
598
606
.

Quang
D.
,
Xie
X.
(
2016
)
DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences
.
Nucleic Acids Res
.,
44
,
e107
.

Rao
S.S.
 et al. (
2014
)
A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping
.
Cell
,
159
,
1665
1680
.

Scarselli
F.
 et al. (
2008
)
The graph neural network model
.
IEEE Trans. Neural Netw
.,
20
,
61
80
.

Schreiber
J.
 et al. (
2018
)
Nucleotide sequence and DNase I sensitivity are predictive of 3D chromatin architecture
.
bioRxiv
,
103614
.

Shrikumar
A.
 et al. (
2017
) Learning important features through propagating activation differences. In Proceedings of the 34th International Conference on Machine Learning. Vol.
70
, pp.
3145
3153
. JMLR. org.

Simonyan
K.
 et al. (
2013
) Deep inside convolutional networks: visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034.

Singh
R.
 et al. (
2017
) GaKCo: a fast gapped k-mer string kernel using counting. In
Joint European Conference on Machine Learning and Knowledge Discovery in Databases
, pp.
356
373
.
Springer
.

Stormo
G.D.
(
2000
)
DNA binding sites: representation and discovery
.
Bioinformatics
,
16
,
16
23
.

Vaswani
A.
 et al. (
2017
) Attention is all you need. In Advances in Neural Information Processing Systems, pp.
5998
6008
.

Veličković
P.
 et al. (
2017
) Graph attention networks. arXiv preprint arXiv:1710.10903.

Wang
X.
 et al. (
2018
) Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.
7794
7803
.

Wikipedia contributors. (

2019
) Chromosome conformation capture—Wikipedia, the free encyclopedia (5 November 2019, date last accessed).

Wong
K.-C.
(
2017
)
Motifhyades: expectation maximization for de novo DNA motif pair discovery on paired sequences
.
Bioinformatics
,
33
,
3028
3035
.

Zhou
J.
,
Troyanskaya
O.G.
(
2015
)
Predicting effects of noncoding variants with deep learning–based sequence model
.
Nat. Methods
,
12
,
931
934
.

Zhou
J.
 et al. (
2018
)
Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk
.
Nat. Genet
.,
50
,
1171
1179
.

Zitnik
M.
 et al. (
2018
)
Modeling polypharmacy side effects with graph convolutional networks
.
Bioinformatics
,
34
,
i457
i466
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Supplementary data