Cooperation of local features and global representations by a dual-branch network for transcription factor binding sites prediction

Yu, Yutong; Ding, Pengju; Gao, Hongli; Liu, Guozhu; Zhang, Fa; Yu, Bin

doi:10.1093/bib/bbad036

Abstract

Interactions between DNA and transcription factors (TFs) play an essential role in understanding transcriptional regulation mechanisms and gene expression. Due to the large accumulation of training data and low expense, deep learning methods have shown huge potential in determining the specificity of TFs-DNA interactions. Convolutional network-based and self-attention network-based methods have been proposed for transcription factor binding sites (TFBSs) prediction. Convolutional operations are efficient to extract local features but easy to ignore global information, while self-attention mechanisms are expert in capturing long-distance dependencies but difficult to pay attention to local feature details. To discover comprehensive features for a given sequence as far as possible, we propose a Dual-branch model combining Self-Attention and Convolution, dubbed as DSAC, which fuses local features and global representations in an interactive way. In terms of features, convolution and self-attention contribute to feature extraction collaboratively, enhancing the representation learning. In terms of structure, a lightweight but efficient architecture of network is designed for the prediction, in particular, the dual-branch structure makes the convolution and the self-attention mechanism can be fully utilized to improve the predictive ability of our model. The experiment results on 165 ChIP-seq datasets show that DSAC obviously outperforms other five deep learning based methods and demonstrate that our model can effectively predict TFBSs based on sequence feature alone. The source code of DSAC is available at https://github.com/YuBinLab-QUST/DSAC/.

transcription factor binding sites, self-attention mechanism, convolutional neural network, dual-branch

Introduction

Transcription factors (TFs) regulate the gene transcription process by specifically binding DNA fragments of the regulatory region, and these binding recognition sites are called transcription factor binding sites (TFBSs) [1]. Transcription is the main stage of gene regulation, influencing gene expression through specific binding of TFs and DNA [2, 3]. Identification of TFBSs has gained significant importance in the field of bioinformatics, attracting increasing researchers’ attention due to the frequent association with transcription and the key to understanding of the transcriptional regulation mechanism. With the accumulation of large-scale DNA sequencing results and the development of high-throughput experimental techniques, intensive research has been done and one of them is Chromatin Immunoprecipitation-sequence (ChIP-seq), [4] which is used to directly obtain unbiased binding sites from sequenced genomes by combining ChIP with ultra-high-throughput massively parallel sequencing. Even though ChIP-seq has achieved good performance, there are still some problems. On the one hand, the large amount of data and bias generated during the experiment makes the noise exist consistently [5, 6]. On the other hand, ChIP-seq assays require a high amount of experimental materials that are difficult to obtain. For these reasons, new decent solutions need to be discovered.

Computational approaches have been applied to solve many bioinformatics and computational biology problems [7–9], owing to the fact that they do not require specific experimental materials. Conventional methods, such as machine learning methods are developed to improve the prediction. Founding work to predict protein–DNA bindings by computational methods involved the use of Hidden Markov model (HMM) [10]，support vector machines(SVM) [11] and random forest [12, 13]. However, a current limitation with all traditional machine-learning algorithms is that for the ever-increasing amount of DNA sequence data, the improvement of the forecast progress and the full utilization of the experimental data cannot be met at the same time.

Since the advent of deep learning, computer vision and natural language processing have gained substantial advancements [14–16]. In recent years, methods based on deep learning, especially the convolutional neural networks (CNNs) [17], have been subsequently applied to the prediction of TFBSs successfully [18–22]. For example, DeepBind [23] utilizes CNNs to ascertain sequence specificities from experimental data and DeepSEA [24] leverages CNNs to predict the noncoding-variant effects de novo from sequence. In addition to these CNN-based methods, there are some networks that try to combine CNNs with other algorithms, such as DanQ [25] is a hybrid framework that combines CNNs and bi-directional long short-term memory (Bi-LSTM) units. The deep learning methods are innately proficient in retrieving the positional relationship between the sequence signals while decreasing computational complexity, which not only improves the prediction accuracy, but also has obvious performance improvement compared with machine learning, making the application of deep learning in this field is more and more extensive and flexible.

The recent progress of natural language processing has been driven in part by the self-attention mechanism [26–28]. The self-attention mechanism relies on the correlation information between different positions in a single sequence to model the features of the sequence, and each position is computed as the weighted sum of all positions. Sequence models can be enhanced to effectively capture long-distance dependencies [29, 30], addressing a major shortcoming of convolutional and recurrent networks. Considering the complementary properties of CNN and self-attention, better results can be prophetically achieved by integrating these two modules [31–34]. The problems in natural language processing tasks can be approximated as sequence problems, and consequently, these methods are applicable to the identification of associations in DNA or RNA sequences. Ullah F and Ben-Hur A [35] proposed a self-attention based deep learning model, which mainly contains a CNN layer and a multi-head self-attention layer to capture regulatory element interactions in genomic sequence. SAResNet proposed by Shen et al. [36] combines the self-attention mechanism and residual network structure and the spatial information and the local information are incorporated to improve the learning ability of the network. These remarkable works show that the self-attention layer is particularly useful for tasks in the detection of potential motifs and can capture a global landscape of interactions between regulatory elements in a given sequence.

In this work, we propose DSAC, a dual-branch network model, merging global representations extracted by self-attention branch and local features extracted by CNN branch for TF binding prediction. The performance of DSAC is evaluated on 165 ChIP-seq datasets, and the results demonstrate that feature extraction in an interactive way at the early stage and the dual-branch architecture at the later stage can tremendously enhance the representation learning of DNA sequences and make significant contributions to improving the model performance.

Materials and methods

Datasets

To evaluate the predictive ability of our model, we chose the same datasets that were used in the work conducted by Zhang et al. [37]. In detail, the 165 ChIP-seq datasets were collected from 690 ChIP-seq datasets produced by the Encyclopedia of DNA Elements (ENCODE) project [38], where positive sets consist of 101 bp DNA sequences that were experimentally confirmed to bind to a given DNA-binding protein and negative sets consist of shuffled positive sequences that maintain dinucleotide frequency. These datasets contain 29 TFs from various cell lines and a detailed description of them is shown in Supplementary Table S1.

Performance evaluation metrics

Accuracy (ACC) [39] is used to simply indicate the percentage of correct predictions for all TFBSs. ROC-AUC represents the area under the receiver operating characteristic (ROC) curve and PR-AUC means the area under the precision-recall curve. Previous studies have focused on ROC-AUC as a measure of performance, even though this is a good assessment of classification performance, it is overly influenced by false positive predictions. When the sequence data are unbalanced, it will favor the side of the large sample, while PR-AUC provides a better assessment of false positives [40, 41]. Hence, we use ACC, ROC-AUC and PR-AUC as the evaluation metrics.

Model architecture

We will start the discussion in three parts. The first two parts describe the structure of the convolution module and the self-attention module in detail and the last one summarizes the overall architecture, showing how these two modules interact to jointly construct the model.

CNN module

Previous CNN-based prediction methods have demonstrated the ability of CNNs to learn sophisticated features. However, CNNs with different architectures will lead to distinct performance levels of the network [42–44]. An increase of convolution kernels facilitates the detection of motif variants, and the stacking of convolutional layers makes the network deeper and enhances feature extraction. Without the stacking operation, the single-layer convolution lays special emphasis on extracting local features. In order to obtain hierarchical representations over the input sequence [45] and more comprehensively identify DNA-binding motifs, the multi-layer convolutional neural network is widely used [46] and it achieves the desired objectives through the cooperation among convolution layers. But this makes it difficult to train with too many parameters, and in most cases, the global information obtained is lossy and incomplete. Therefore, our model only uses two CNN layers for local feature extraction and the two-dimensional convolution at each position |$i$| can be formally defined by:

$$ \begin{equation} Conv\left({E}_k{S}_i\right)=\sum \limits_{m=1}^l\sum \limits_{\tau =1}^{\gamma }{E}_{k_m,\tau }{S}_{i+m-1,\tau } \end{equation} $$

(1)

where |$S$| is the input sequence and |${E}_k$| represents convolutional filters corresponding to |$S$|⁠, and |$m$| indicates the position at which the convolutional operation is performed, |$\tau$| is the index of the filter, with the |$l\times \gamma$| weight matrix can be interpreted as a sequence motif detector where |$l$| denotes the length of filter and |$\gamma$| is the channel number of |$S$|⁠.

Self-attention module

Figure 1B shows the framework of self-attention attention mechanism. The input features are first normalized using batch normalization [47] to reduce internal covariate shift so that layer after it can learn independently from previous layers, and more concentrate on its own task, and so as to speed up the training process. After the batch normalization operation, we apply the Rectified Linear Unit activation function (ReLU), which is calculated as follows:

$$ \begin{equation} \operatorname{Re} LU(x)=\max \left(0,x\right) \end{equation} $$

(2)

Figure 1

The illustrative diagram of DSAC. (A) Block diagram of the network architecture of DSAC. (B) self-attention module.

Open in new tab Download slide

Next, the input |$X$| is linearly transformed to get Query |$Q\in{R}^{T\times{d}_k}$|⁠, Key |$K\in{R}^{T\times{d}_k}$| and Value |$V\in{R}^{T\times{d}_v}$| where |$T$| is the sequence length, |${d}_k$| and |${d}_v$| are the hidden dimensionality for query or key and value, respectively. Query, key and value are calculated as:

$$\begin{equation} Q={W}_Q^TX \end{equation}$$

(3)

$$\begin{equation} K={W}_K^TX \end{equation}$$

(4)

$$ \begin{equation} V={W}_V^TX \end{equation} $$

(5)

where |${W}_Q$|⁠, |${W}_K$| and |${W}_V$| are the learned weight matrices of query, key and value vectors, respectively. The scaled dot-product attention is chosen in the DSAC framework, more specifically, the attention value from element |$i$| to |$j$| is based on its similarity of |${Q}_i$| and |${K}_j$| and then normalized and multiplied by |$V$| to get the final attention weight, which can be described as:

$$ \begin{equation} Attention\left(Q,K,V\right)= soft\max \left(\frac{Q{K}^T}{\sqrt{d_k}}\right)V \end{equation} $$

(6)

where |$Q{K}^T$| executes the dot product for every possible pair of Queries and Keys, resulting in a matrix of the shape |$T\times T$|⁠. |$\frac{1}{\sqrt{d_k}}$|⁠, the scaling factor, is crucial to maintain an appropriate variance of attention values after initialization, chiefly preventing the magnitude of the input to the softmax function from being too large. The self-attention mechanism enables the model to dynamically focus on the residues from the sequences, obtain long-term dependencies between residues, and provide the ability to capture the global features of the input DNA sequences.

In our work, two self-attention modules with dropout [48] were used in the first and final stages of the DSAC architecture, respectively. A dropout rate of 0.2 in the first stage and a dropout rate of 0.7 in the final stage were utilized to effectively alleviate the occurrence of overfitting. We performed worse on the other two sets of dropout values, (0.3, 0.7) and (0.4, 0.6), suggesting that dropout value of (0.2, 0.7) is better at taking advantage of the model's ability to extract information and prevent overfitting.

Model overview

Figure 1A shows the overall architecture of DSAC. Our model starts with a self-attention module followed by a convolutional operation and a 2D max pooling layer to get a preliminary perception of local and global features. The function of max pooling layer is to reduce the length of resulting sequences so as to reduce the number of parameters and prevent overfitting. Additionally, we input the feature matrices generated by the above initial modules into the CNN module and the self-attention module, respectively, where the CNN module is composed of a batch normalization layer, an exponential linear unit (ELU) and a |$1\times 3$| convolutional layer. And the feature matrix |$X$| computed by the operation of one CNN module can be described as:

$$ \begin{equation} X= Conv\left( ELU\left( BN(S)\right)\right) \end{equation} $$

(7)

Finally, the results of these two branches are input into two classifiers, and the average of the outputs from the two classifiers is used as the ultimate prediction value. The operation of the output module is defined by:

$$ \begin{equation} {Y}_i=\frac{1}{2} Sigmoid\left({f}_{Ba}(S)\right)+\frac{1}{2} Sigmoid\left({f}_{Bc}(S)\right) \end{equation} $$

(8)

where |${Y}_i$| is the prediction value of the |$i$|th input sequence in one dataset and |${f}_{Ba}$| and |${f}_{Bc}$| represent a range of operations of the self-attention branch and the CNN branch, respectively.

From a global perspective, we use the global context obtained from the self-attention module as feature maps to strengthen the global perception capability of the CNN branch, and local features from the convolution layer are sustainability embed to reinforce the self-attention branch's grasp of local particulars. In this way, there is a flexible interaction between the two branches. In addition, this dual-branch structure also improves the generalization ability of the model and speeds up the model training speed.

Feature representation and model implementation

We use one-hot vectors to represent DNA sequences as input to the model. Each DNA sequence is encoded into a |$101\times 4$| binary matrix, with columns corresponding to A, G, C and T. Each base pair in the sequence is denoted as one of four one-hot vectors [1,0,0,0], [0,1,0,0], [0,0,1,0] and [0,0,0,1]. Our model is implemented by PyTorch (v1.10) [49], and the binary cross entropy function is utilized to calculate the loss:

$$ \begin{equation} L\left(y,\overset{\wedge }{y}\right)=-\frac{1}{N}\sum \limits_{i=1}^N\left[{y}_i\log \left(\overset{\wedge }{y_i}\right)+\left(1-{y}_i\right)\log \left(1-\overset{\wedge }{y_i}\right)\right] \end{equation} $$

(9)

where |$y$| is the vector of ground truth labels, |$\overset{\wedge }{y}$| indicates the predicted value by network, and |$N$| is the size of batch. The initial learning rate is set to 0.001 and Adam [50] optimizer is adopted. To evaluate the performance of DSAC, each dataset is simply split into 80 and 20% for the train set and validation set, respectively and then we get the final model performance metrics by an independent test set. The network is trained for 15 epochs for all datasets. In addition, for every train and validation set, the batch size is set to 64 and for each test set, the batch size is set to 1. We choose two self-attention modules instead of one because the model can better capture global features when two self-attention modules interact with the CNNs. The detailed hyperparameter settings are summarized in Supplementary Table S2 and the number of epochs can be determined by referring to Supplementary Figure S1.

Results and discussion

Model variants

In order to specifically evaluate the performance and show the improvement of our model, we will make a discussion from two aspects: features and structures. For feature analysis, the discussion mainly includes dissection of the local features extracted from CNN module and the global representations extracted from the self-attention mechanism. In this part, we designed a model without CNN branch and self-attention branch to compare with the original model, which can effectively reflect the influence of local features and global features on the prediction effect of the model. For structure analysis, we designed two model variants. The one is to simply stack the CNN module and the self-attention module linearly and is dubbed as SACSAC, which means that CNN modules and self-attention modules are interleaved and finally fed into one classifier. And the other variant, SCAF is to cancel the architecture of the two classifiers based on the original model structure. After the execution of the CNN branch and the self-attention branch, the refined feature maps obtained by the two branches from the sequence profiles are concatenated along the first dimension and features are reintegrated by a fully connected layer. Finally, a sigmoid classifier is used to obtain the output classification probabilities. These architectures of the model variants were designed to be compared with the model of the dual structure, thereby demonstrating the superiority of our model structure. A detailed description of the model variants with different structures is shown in Figure 2. All the model variants were trained on 165 ChIP-seq benchmark data sets, which are same to the proposed model DSAC, and the same hyperparameter settings were used in all experiments.

Figure 2

The brief illustration of variant models: (A) the SACSAC model. (B) the SACF model.

Open in new tab Download slide

Analysis on features

Without the CNN branch, when the local features, which are extracted through the previous convolution layers, are embedded in the self-attention module and the model will lay special emphasis on capturing global information about DNA sequences rather than local details. Therefore, through this comparison, it can indicate that an independent self-attention branch does not pay enough attention to local features, resulting in poor final prediction effects. Without the self-attention branch, additionally the stacking degree of the convolution is not enough, only the initial self-attention mechanism cannot obtain sufficient global features. Therefore, the model lacks the extraction of multi-level features, resulting in a decrease in the performance of the model. Figure 3 illustrates the performance of DSAC and model variants under the average ACC, ROC-AUC and PR-AUC metrics, and it can be seen that our model outperforms the model without CNN branch and the model without self-attention branch on all the aforementioned metrics. Specifically, the average ROC-AUC of DSAC was 0.887, which was 0.8% higher than the model without self-attention branch (0.879) and 2.1% higher than the model without CNN branch (0.866). And we also observed that the relative improvement of DSAC (0.891) overed the model without CNN branch (0.885) was 0.6% and the model without self-attention branch (0.869) was 2.2% in terms of PR-AUC scores. It can also be seen from Figure 4 that DSAC significantly outperforms the variant models under the ROC-AUC and PR-AUC distributions. This indicates that our model can extract the more informative features and better in TFBSs prediction. Furthermore, the detailed local features gradually supplied by the CNN branch and the global representations steadily learned by the self-attention branch are not isolated, but applied to our model in a complementary and interactive manner, which inherits the advantages of convolutional network and self-attention mechanism.

Figure 3

Comparison of DSAC with the model without CNN branch and the model without self-attention branch on 165 ChIP-seq datasets under the average ACC, ROC-AUC and PR-AUC metrics.

Open in new tab Download slide

Figure 4

The ROC-AUC distribution and PR-AUC distribution of DSAC and variant models. The white point and the thick black bar in the middle indicate the median and interquartile range, respectively. The black line represents the interval from the minimum non-outlier value to the maximum non-outlier value.

Open in new tab Download slide

Analysis on structure

Among the connection methods of the CNN module and the self-attention module, many researchers have proposed that self-attention can be used to enhance the convolutional networks. The significant defect of the convolution operation is that it only works on a local neighborhood, thus missing the global information. While the self-attention module has the capability of obtaining more relational information in the network, which provides the possibility to optimize the performance of convolutional networks. Another way is to improve self-attention networks with the CNN modules. The self-attention mechanism captures global contextual information at each location, thus losing the ability to focus on a specific region, and additional convolution modules make up for this shortcoming by adding local information to the self-attention network. Instead, our model structure can make self-attention and convolution mutually promote each other, making the best use of the advantages of both networks. From Figure 5, it is evident that DSAC is substantially better than the two model variants, SACSAC and SACF. As can be seen from the comparison with SACSAC, the dual-branch design structure of our model is better than the architecture of interleaved stacking of CNN and self-attention modules. It is because information between modules cannot be effectively transitioned, as a result, the simple stacked structure cannot make the CNN or self-attention handle the results from the previous layer well, and naturally cannot fully utilize the complementarity of the two modules. SACF also adopts the dual-branch architecture, but the prediction effect of using features after concatenation and reintegration is far less than that without concatenation and reintegration. It can also be seen from Figure 5 that DSAC performs better on the vast majority of datasets than SACF. This shows that both branches have the ability of independent prediction, and the structure of SACF makes the integrated features have a lot of redundant information, thus reducing the prediction accuracy. In summary, the design of our model architecture enables the two branches to extract various feature information at the same time, and subsequently can specialize in the features obtained by themselves after obtaining complementary information from each other, which greatly enhances the model’s ability to capture DNA sequence information.

Figure 5

Comparison of DSAC with SACSAC and SACF on 165 ChIP-seq datasets under the ROC-AUC and PR-AUC metrics.

Open in new tab Download slide

Comparison with other methods

To further evaluate the performance of DSAC, we compare it with various deep-learning-based models, including DeepBind [23], DanQ [25], CRPTS [51], DLBSS [52] and D-SSCA [37]. As in the previous experiments, 165 ChIP-seq experimental datasets were utilized to evaluate our model and all the competing methods. Figure 6 shows the comparison results (ACC, ROC-AUC and PR-AUC scores across all the test sets) between DSAC and other predictors. Among all the evaluation metrics calculated in our work, the maximum values were higher than the models except DLBSS, even though the maximum values were not higher than DLBSS, the minimum values were substantially improved compared with other competing approaches, which shows that our model achieved a surprisingly good generalization ability. These results indicate that our model performs better than all other methods, revealing that the interactive feature extraction and the dual-branch network architecture contribute to improving the prediction performance of the model. From Table 1, we can clearly observe that DSAC achieved a statistically significant performance improvement in terms of ACC, ROC-AUC and PR-AUC scores. To be specific, the average ACC, ROC-AUC and PR-AUC scores of DSAC were 0.816, 0.887 and 0.891, respectively, which were 2.3, 2.% and 2.0% higher than the suboptimal model (0.793, 0.867 and 0.871, respectively).

Figure 6

Performance comparison between D-SSCA and other five competing methods on 165 ChIP-seq datasets under the ACC, ROC-AUC and PR-AUC metrics. The middle line inside the box indicates the median and the ends of the box are the upper and lower quartiles. The two lines outside the box are the whiskers extending to the highest and lowest observations and the diamond marks indicate outliers.

Open in new tab Download slide

Among these competing methods, DeepBind and DLBSS are the models only based on CNNs while DanQ and CRPTS both use a hybrid framework that combines CNNs and BLSTMs. Moreover, D-SSCA mainly uses convolution modules and attention mechanisms. Our model performs better than DeepBind and DLBSS, which shows that the addition of the self-attention module is effective and plays a crucial role in the performance of our presented model. What causes the results is that the model with mutual cooperation of convolution modules and self-attention mechanisms can extract informative features more than the model with the stacking of convolution modules. DeepBind and DLBSS perform worse than our model and it mainly because only using CNNs with a limited receptive field is not enough to extract long-term dependencies and obtain sufficient information. BLSTM is a variant of the recurrent neural network (RNN) that consists of two LSTMs: one taking the input in a forward direction and the other in a backward direction. BLSTMs can effectively increase the amount of information available to the network and capture long-term dependencies like self-attention mechanism. However, Both DanQ and CRPTS are worse than our model, which indicates that self-attention module in our model architecture has a strong learning ability for long-range dependencies in DNA sequences. On the other hand, it shows that BLSTMs, which involved in DanQ and CRPTS can capture global information, but they do not make good use of local information when combined with CNNs. Particularly, in terms of capturing long-range dependencies, the addition of the self-attention mechanism makes our model neither need to apply recurrent networks, which are computationally expensive to train, nor need to use convolution networks with large window sizes.

Table 1

Open in new tab

Performance comparison of different methods

Method	ACC	ROC-AUC	PR-AUC
DanQ	0.782	0.849	0.855
DeepBind	0.785	0.853	0.858
CRPTS	0.793	0.862	0.867
DLBSS	0.793	0.865	0.871
D-SSCA	0.793	0.867	0.871
DSAC	0.816	0.887	0.891

Method	ACC	ROC-AUC	PR-AUC
DanQ	0.782	0.849	0.855
DeepBind	0.785	0.853	0.858
CRPTS	0.793	0.862	0.867
DLBSS	0.793	0.865	0.871
D-SSCA	0.793	0.867	0.871
DSAC	0.816	0.887	0.891

Table 1

Open in new tab

Performance comparison of different methods

Method	ACC	ROC-AUC	PR-AUC
DanQ	0.782	0.849	0.855
DeepBind	0.785	0.853	0.858
CRPTS	0.793	0.862	0.867
DLBSS	0.793	0.865	0.871
D-SSCA	0.793	0.867	0.871
DSAC	0.816	0.887	0.891

Method	ACC	ROC-AUC	PR-AUC
DanQ	0.782	0.849	0.855
DeepBind	0.785	0.853	0.858
CRPTS	0.793	0.862	0.867
DLBSS	0.793	0.865	0.871
D-SSCA	0.793	0.867	0.871
DSAC	0.816	0.887	0.891

In summary, to evaluate the performance improvement of DSAC, we compared with five deep-learning-based models on 165 ChIP-seq experimental datasets. The results show that our model performs superior to other competing models and demonstrate the feasibility of a dual-branch model architecture combining self-attention mechanism and CNNs.

Conclusion

In this study, we have presented DSAC, a dual-branch combination network for TF binding prediction. Deep learning frameworks have been extensively applied to bioinformatics due to limited training data and high computational cost, and have shown thrillingly promising potential to mine the complex relationship hidden in large-scale biological data [53]. Our work is a demonstration of one method, which combines global representations extracted by self-attention branch and local features extracted by CNN branch, enhancing feature learning in an interactive fashion. In addition, our model adopts a dual-branch architecture, and the comparison with the model based on the architecture of interleaved stacking of convolutional blocks and self-attention modules and the model that fuses two-branch features can show that our model structure can more effectively utilize the advantages of the CNN branch and the self-attention branch. Moreover, benchmarking experiments show that the performance of DSAC is apparently higher than that of other competing deep learning methods on the 165 ChIP-seq datasets.

Even though DSAC has achieved a good performance, there are still some limitations and we can further improve our model in the following aspects. First, the architecture of DSAC is limited to setting up independent paths for CNN and self-attention modules. Although there is an interaction between the two modules in our model, the internal relationship between them is not fully utilized. Later research can start from the internal structure of the two modules to find better ways to connect with each other. Second, due to the limitation of computing resources, we did not design comparative experiments with different hyperparameters, but only roughly determined the hyperparameters. Third, we only used DNA sequences to extract features and no other information such as DNA shape profiles can be fused to improve the predictive power of TFBSs.

Key Points

This study proposes a dual-branch network, termed DSAC, which combines the self-attention mechanism and CNN modules, to predict transcription factor binding sites.
The self-attention mechanism can focus on global representation learning and the CNN modules pay more attention to local feature details. The two in our model reached a good cooperation, enhancing the representation learning of DNA sequences.
The dual branch architecture allows better interaction between the self-attention mechanism and the CNN module and also enhances the robustness of the model.
Benchmarking experiments show that the performance of DSAC is apparently higher than that of other competing deep learning methods on the 165 ChIP-seq datasets.

Funding

National Natural Science Foundation of China (62172248, 61932018); Natural Science Foundation of Shandong Province of China (ZR2021MF098).

Author Biographies

Yutong Yu is an undergraduate at the College of Information Science and Technology, Qingdao University of Science and Technology, China. Her research interests are bioinformatics and deep learning.

Pengju Ding is a master student at the College of Information Science and Technology, Qingdao University of Science and Technology, China. Her research interests are bioinformatics and deep learning.

Hongli Gao is a master student at the College of Mathematics and Physics, Qingdao University of Science and Technology, China. Her research interests are bioinformatics and deep learning.

Guozhu Liu is a professor at the College of Information Science and Technology, Qingdao University of Science and Technology, China. His research interests include artificial intelligence, machine learning and algorithms.

Fa Zhang is a professor at the School of Medical Technology, Beijing Institute of Technology, China. His research interests include bioinformatics, high-performance computing and machine learning.

Bin Yu is a professor at the College of Information Science and Technology, School of Data Science, Qingdao University of Science and Technology, China. His research interests include bioinformatics, artificial intelligence and biomedical image processing.

References

1.

Latchman

DS

.

Transcription factors: An overview

.

Int J Biochem Cell Biol

1997

;

29

:

1305

–

12

.

2.

Elnitski

L

,

Jin

VX

,

Farnham

PJ

, et al.

Locating mammalian transcription factor binding sites: a survey of computational and experimental techniques

.

Genome Res

2006

;

16

:

1455

–

64

.

3.

Orenstein

Y

,

Shamir

R

.

A comparative analysis of transcription factor binding models learned from PBM, HT-SELEX and ChIP data

.

Nucleic Acids Res

2014

;

42

:

e63

.

4.

Kharchenko

PV

,

Tolstorukov

MY

,

Park

PJ

.

Design and analysis of ChIP-seq experiments for DNA-binding proteins

.

Nat Biotechnol

2008

;

26

:

1351

–

9

.

5.

Jothi

R

,

Cuddapah

S

,

Barski

A

, et al.

Genome-wide identification of in vivo protein-DNA binding sites from ChIP-Seq data

.

Nucleic Acids Res

2008

;

36

:

5221

–

31

.

6.

Taslim

C

,

Wu

J

,

Yan

P

, et al.

Comparative study on ChIP-seq data: normalization and binding pattern characterization

.

Bioinformatics

2009

;

25

:

2334

–

40

.

7.

Stormo

GD

.

Consensus patterns in DNA

.

Methods in Enzym

1990

;

183

:

211

–

21

. 10.1016/0076-6879(90)83015-2.

Google Scholar

Crossref

WorldCat

8.

Zhao

X

,

Huang

H

,

Speed

TP

.

Finding short DNA motifs using permuted markov models

.

J Comput Biol

2005

;

12

:

894

–

906

.

9.

Long

P

,

Zhang

L

,

Huang

B

, et al.

Integrating genome sequence and structural data for statistical learning to predict transcription factor binding sites

.

Nucleic Acids Res

2020

;

48

:

12604

–

17

.

10.

Won

K-J

,

Ren

B

,

Wang

W

.

Genome-wide prediction of transcription factor binding sites using an integrated model

.

Genome Biol

2010

;

11

:

R7

.

11.

Djordjevic

M

,

Sengupta

AM

,

Shraiman

BI

.

A biophysical approach to transcription factor binding site discovery

.

Genome Res

2003

;

13

:

2381

–

90

.

12.

Xiao

Y

,

Segal

MRJPCB

.

Identification of Yeast Transcriptional Regulation Networks Using Multivariate Random Forests

.

PLos Comput Biol

2009

;

5

:e1000414.

Google Scholar

OpenURL Placeholder Text

WorldCat

13.

Hooghe

B

,

Broos

S

,

van

Roy

F

, et al.

A flexible integrative approach based on random forest improves prediction of transcription factor binding sites

.

Nucleic Acids Res

2012

;

40

:e106.

Google Scholar

OpenURL Placeholder Text

WorldCat

14.

Long

J

,

Shelhamer

E

,

Darrell

T

.

Fully convolutional networks for semantic segmentation

.

IEEE Trans Pattern Anal Mach Intell

2017

;

39

:

640

–

51

.

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

15.

Ren

S

,

He

K

,

Girshick

R

, et al.

Faster R-CNN: towards real-time object detection with region proposal networks

.

IEEE Trans Pattern Anal Mach Intell

2017

;

39

:

1137

–

49

.

16.

Devlin

J

,

Chang

M

,

Lee

K

, et al.

Bert: Pre-training of deep bidirectional transformers for language understanding

. arXiv preprint arXiv:1810.04805. 2019.

17.

Lecun

Y

,

Bottou

L

,

Bengio

Y

, et al.

Gradient-based learning applied to document recognition

.

Proceedings of the IEEE

1998

;

86

:

2278

–

324

.

Google Scholar

Crossref

WorldCat

18.

Min

S

,

Lee

B

,

Yoon

S

.

Deep learning in bioinformatics

.

Brief Bioinform

2017

;

18

:

851

–

69

.

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

19.

Zhang

Q

,

Zhu

L

,

Bao

W

, et al.

Weakly-supervised convolutional neural network architecture for predicting protein-DNA binding

.

IEEE/ACM Trans Comput Biol Bioinform

2020

;

17

:

679

–

89

.

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

20.

Zhang

Y

,

Qiao

S

,

Ji

S

, et al.

DeepSite: bidirectional LSTM and CNN models for predicting DNA–protein binding

.

Int J Mach Learn Cyber

2020

;

11

:

841

–

51

.

Google Scholar

Crossref

WorldCat

21.

Zheng

A

,

Lamkin

M

,

Zhao

H

, et al.

Deep neural networks identify sequence context features predictive of transcription factor binding

.

Nat Mach Intell

2021

;

3

:

172

–

80

.

22.

Jing

F

,

Zhang

S

,

Zhang

S

.

Prediction of the transcription factor binding sites with meta-learning

.

Methods

2022

;

203

:

207

–

13

.

23.

Alipanahi

B

,

Delong

A

,

Weirauch

MT

, et al.

Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning

.

Nat Biotechnol

2015

;

33

:

831

–

8

.

24.

Zhou

J

,

Troyanskaya

OG

.

Predicting effects of noncoding variants with deep learning–based sequence model

.

Nat Methods

2015

;

12

:

931

–

4

.

25.

Quang

D

,

Xie

X

.

DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences

.

Nucleic Acids Res

2016

;

44

:

e107

.

26.

Vaswani

A

,

Shazeer

N

,

Parmar

N

, et al. Attention is all you need. In:

Advances in Neural Information Processing System

.

Long Beach, California, USA, MIT Press

,

2017

,

5998

–

6008

.

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

27.

Nagoudi

EMB

,

Elmadany

A

,

Abdul-Mageed

MJA

.

Arat5: Text-to-text transformers for arabic language understanding and generation

. arXiv preprint arXiv:2109.12068 2021.

28.

Sengupta

A

,

Bhattacharjee

SK

,

Chakraborty

T

, et al.

HIT: A hierarchically fused deep attention network for robust code-mixed language representation

. arXiv preprint arXiv:2105.14600. 2021.

29.

Uddin

MR

,

Mahbub

S

,

Rahman

MS

, et al.

SAINT: self-attention augmented inception-inside-inception network improves protein secondary structure prediction

.

Bioinformatics

2020

;

36

:

4599

–

608

.

30.

Qin

X

,

Cai

R

,

Yu

J

, et al.

An efficient self-attention network for skeleton-based action recognition

.

Sci Rep

2022

;

12

:

4111

.

31.

Bello

I

,

Zoph

B

,

Le

Q

, et al.

Attention Augmented Convolutional Networks

. In:

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

,

Seoul, Korea

:

IEEE

.

2019

:

3285

-

94

.

32.

Pan

X

,

Ge

C

,

Lu

R

, et al.

On the integration of self-attention and convolution

. arXiv preprint arXiv:2111.14556. 2022.

33.

Peng

Z

,

Huang

W

,

Gu

S

, et al. Conformer: Local Features Coupling Global Representations for Visual Recognition. In:

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

.

Montreal, QC, Canada

:

IEEE

,

2021

,

367

–

76

.

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

34.

Jing

F

,

Zhang

S

,

Zhang

S

.

Prediction of transcription factor binding sites with an attention augmented convolutional neural network

.

IEEE/ACM Trans Comput Biol Bioinform

2021

. https://doi.org/10.1109/TCBB.2021.3126623,

PP

,

1

.

Google Scholar

OpenURL Placeholder Text

WorldCat

35.

Ullah

F

,

Ben-Hur

A

.

A self-attention model for inferring cooperativity between regulatory features

.

Nucleic Acids Res

2021

;

49

:e77.

Google Scholar

OpenURL Placeholder Text

WorldCat

36.

Shen

L

,

Liu

Y

,

Song

J

, et al.

SAResNet: self-attention residual network for predicting DNA-protein binding

.

Brief Bioinform

2021

;

22

:

bbab101

.

37.

Zhang

Y

,

Wang

Z

,

Zeng

Y

, et al.

A novel convolution attention model for predicting transcription factor binding sites by combination of sequence and shape

.

Brief Bioinform

2022

;

23

:

bbab525

.

38.

Dunham

I

,

Kundaje

A

,

Aldred

SF

, et al.

An integrated encyclopedia of DNA elements in the human genome

.

Nature

2012

;

489

:

57

–

74

.

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

39.

Yu

B

,

Qiu

W

,

Chen

C

, et al.

SubMito-XGBoost: predicting protein submitochondrial localization by fusing multiple feature information and eXtreme gradient boosting

.

Bioinformatics

2020

;

36

:

1074

–

81

.

40.

Liu

Y

,

Jin

S

,

Gao

H

, et al.

Predicting the multi-label protein subcellular localization through multi-information fusion and MLSI dimensionality reduction based on MLFE classifier

.

Bioinformatics

2022

;

38

:

1223

–

30

.

41.

Zhang

Q

,

Zhang

Y

,

Li

S

, et al.

Accurate prediction of multi-label protein subcellular localization through multi-view feature learning with RBRL classifier

.

Brief Bioinform

2021

;

22

:

bbab012

.

42.

Zeng

H

,

Edwards

MD

,

Liu

G

, et al.

Convolutional neural network architectures for predicting DNA–protein binding

.

Bioinformatics

2016

;

32

:

i121

–

7

.

43.

Wang

M

,

Song

L

,

Zhang

Y

, et al.

Malsite-Deep: Prediction of protein malonylation sites through deep learning and multi-information fusion based on NearMiss-2 strategy

.

Knowl-Based Syst

2022

;

240

:108191.

Google Scholar

OpenURL Placeholder Text

WorldCat

44.

Yu

B

,

Wang

X

,

Zhang

Y

, et al.

RPI-MDLStack: Predicting RNA–protein interactions through deep learning with stacking strategy and LASSO

.

Appl Soft Comput

2022

;

120

:108676.

Google Scholar

OpenURL Placeholder Text

WorldCat

45.

Gehring

J

,

Auli

M

,

Grangier

D

, et al.

Convolutional sequence to sequence learning

. In:

Proceedings of the International Conference on Machine Learning (ICML)

,

Sydney, NSW, JMLR.org

,

2017

:

1243

–

52

.

46.

Zhang

Q

,

Wang

S

,

Chen

Z

, et al.

Locating transcription factor binding sites by fully convolutional neural network

.

Brief Bioinform

2021

;

22

:

bbaa435

.

47.

Ioffe

S

,

Szegedy

C

.

Batch normalization: accelerating deep network training by reducing internal covariate shift

. In:

Proceedings of the International Conference on Machine Learning (ICML)

,

Lille, France, JMLR.org

,

2015

:

448

-

56

.

48.

Srivastava

N

,

Hinton

G

,

Krizhevsky

A

, et al.

Dropout: a simple way to prevent neural networks from overfitting

.

J Mach Learn Res

2014

;

15

:

1929

–

58

.

Google Scholar

OpenURL Placeholder Text

WorldCat

49.

Paszke

A

,

Gross

S

,

Massa

F

, et al.

PyTorch: an imperative style, high-performance deep learning library

.

Adv Neural Inf Process Syst

2019

;

32

:

8026

–

37

.

Google Scholar

OpenURL Placeholder Text

WorldCat

50.

Kingma

DP

,

Ba

J

.

Adam: A Method for Stochastic Optimization

. In:

Proceedings of the International Conference on Learning Representations (ICLR)

,

San Diego, USA, OpenReview.net

,

2015

:

1

–

13

.

51.

Wang

S

,

Zhang

Q

,

Shen

Z

, et al.

Predicting transcription factor binding sites using DNA shape features based on shared hybrid deep learning architecture

.

Mol Ther- Nucl Acids

2021

;

24

:

154

–

63

.

Google Scholar

Crossref

WorldCat

52.

Zhang

Q

,

Shen

Z

,

Huang

DS

.

Predicting in-vitro Transcription Factor Binding Sites Using DNA Sequence + Shape

.

IEEE/ACM Trans Comput Biol Bioinform

2021

;

18

:

667

–

76

.

53.

Li

H

,

Tian

S

,

Li

Y

, et al.

Modern deep learning in bioinformatics

.

J Mol Cell Biol

2020

;

12

:

823

–

7

.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Download all slides

Month:	Total Views:
February 2023	125
March 2023	62
April 2023	56
May 2023	43
June 2023	20
July 2023	31
August 2023	19
September 2023	21
October 2023	32
November 2023	22
December 2023	24
January 2024	178
February 2024	135
March 2024	151
April 2024	127
May 2024	152
June 2024	130
July 2024	116
August 2024	103
September 2024	102
October 2024	118
November 2024	127
December 2024	116
January 2025	75
February 2025	54
March 2025	91
April 2025	62
May 2025	14

Article Contents

Cooperation of local features and global representations by a dual-branch network for transcription factor binding sites prediction

Abstract

Introduction

Materials and methods

Performance evaluation metrics

Model architecture

CNN module

Self-attention module

Model overview

Feature representation and model implementation

Results and discussion

Model variants

Analysis on features

Analysis on structure

Comparison with other methods

Conclusion

Funding

Author Biographies

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Article Contents

Cooperation of local features and global representations by a dual-branch network for transcription factor binding sites prediction

Abstract

Introduction

Materials and methods

Performance evaluation metrics

Model architecture

CNN module

Self-attention module

Model overview

Feature representation and model implementation

Results and discussion

Model variants

Analysis on features

Analysis on structure

Comparison with other methods

Conclusion

Funding

Author Biographies

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only