Abstract

DNase I hypersensitive site (DHS) refers to the hypersensitive region of chromatin for the DNase I enzyme. It is an important part of the noncoding region and contains a variety of regulatory elements, such as promoter, enhancer, and transcription factor-binding site, etc. Moreover, the related locus of disease (or trait) are usually enriched in the DHS regions. Therefore, the detection of DHS region is of great significance. In this study, we develop a deep learning-based algorithm to identify whether an unknown sequence region would be potential DHS. The proposed method showed high prediction performance on both training datasets and independent datasets in different cell types and developmental stages, demonstrating that the method has excellent superiority in the identification of DHSs. Furthermore, for the convenience of related wet-experimental researchers, the user-friendly web-server iDHS-Deep was established at http://lin-group.cn/server/iDHS-Deep/, by which users can easily distinguish DHS and non-DHS and obtain the corresponding developmental stage of DHS.

INTRODUCTION

Some chromatin regions containing transcriptional active gene are more than 100 times more sensitive to DNase I degradation than the nontranscriptionally active region. These chromatin regions are called DNase I hypersensitive sites (DHSs) [1] (Figure 1). The DHSs typically mark compact [less than 250 base pair (bp)] functional cis-regulatory elements (CREs) [2], within promoters, enhancers, inhibitors and insulators, which can provide reliable signposts for high-precision delineation of regulatory DNA in complex genomes [3]. Furthermore, the genetic variations in DHSs are associated with a variety of diseases and phenotypic traits [4]. Recent studies have shown that DHS plays pivotal roles in cancer [5], Alzheimer’s disease [6], coronary artery disease and some common diseases [7]. For example, in the breast cancer research, the most highly mutated DHSs were identified as driver distal regulatory elements and affect the expression of cancer genes [5]. In the study of He et al. [8], DHSs were highly correlated with active genes expression in MSB1 cells, which can be considered as markers to identify the CREs associated with Chicken Marek’s disease. Therefore, precise identification of DHS is of crucial importance not only for deciphering transcriptional regulation but also for understanding the mechanisms of complex diseases.

Schematic diagram of DNase I hypersensitive site in genome.
Figure 1

Schematic diagram of DNase I hypersensitive site in genome.

Now, some experimental techniques [9, 10] have been developed to study DHSs. Especially, the DNase-seq profiling [11] has enabled remarkable progress in understanding of DHSs with different developmental stages across different organs. According to statistics, to date, more than 290 million of DHSs derived from different tissues/cell types with different developmental stages have been recognized by various high-throughput methods [12]. Each tissue/cell type is represented by multiple distinguished DHS profiling derived from different individuals.

Based on such abundant data resources, many computational algorithms for detecting DHSs have been developed. For the identification of DHSs in human genome, Noble et al. [13] designed SVM-RevcKmer model to identify DHSs, in which they used RevcKmer to extract sequence information and established classifier by support vector machine (SVM). Feng et al. [14] developed a SVM-based model using pseudo K-tuple nucleotide composition (PseKNC) to identify DHS. Liu et al. [15] proposed an ensemble predictor named iDHS-EL, which combined three individual Random Forest (RF) classifiers with three different feature extraction methods (Kmer, RevKmer and PseDNC). Xu et al. [16] created predictor named iDHSs-PseTNC via deep sparse auto-encoder with pseudo trinucleotide composition (PseTNC). Manavalan et al. [17] utilized RF algorithm to obtain optimal features set from the combination of nucleotide composition and physicochemical properties (PP). Liang and Zhang [18] extracted features from the detrended moving-average cross-correlation coefficient descriptor based on dinucleotide property matrix and used SVM to construct classification model. Zhang et al. [19] used dinucleotide-based spatial autocorrelation to extract features and then utilized Ensemble Bagged Tree as classifier to build iDHS-DSAMS predictor. Recently, they developed iDHS-DXG model [20] based on XGboost. For the prediction of DHSs in plant genome (Arabidopsis and Rice), Zhang et al. [21] combined Revckmer and dinucleotide-based auto covariance to construct the feature space and build a predictor by SVM called pDHS-SVM. And then, they proposed an ensemble predictor pDHS-WE based on these five classes of features [22]. Soon afterward, they developed an ensemble ELM-based method based on two kinds of sequence-based features (Revckmer and PseDNC) [23]. Lately, they exploited pDHS-DSET method for DHSs prediction [24]. These methods have been summarized in Table 1 in the following aspects, identification species of DHSs, published year, method name, feature extraction method, prediction algorithm, web-server construction and related reference.

Table 1

A comprehensive list of published methods for prediction DHS

SpeciesYearNameFeature extractionClassifierWeb serverRef
Homo sapiens2005SVM-RevcKmerRevcKmerSVMNo[12]
2014SVM-PseKNCPseKNCSVMNo[13]
2016iDHS-ELKmer, RevKmer, PseDNCRFYes[14]
2017iDHSs-PseTNCPseTNCDeep Sparse Auto-encoderNo[15]
2018DHSpredNC, PPSVMNo[16]
2019iDHS-DMCACDPMSVMNo[17]
2020iDHS-DSAMSDSAEnsemble bagged treeNo[18]
2020iDHS-DXGKmer, Mismatch, DSAXGboostNo[19]
Arabidopsis, Rice2017pDHS-SVMRevKmer, DACSVMNo[20]
2018pDHS-WEKmer, RevKmer, Mismatch, PesKNC, ACRFNo[21]
2018pDHS-ELMRevKmer, PseDNCExtreme learning machineNo[22]
2019pDHS-DSETKmer, RevKmer, Mismatch, PseDNCSVMNo[23]
SpeciesYearNameFeature extractionClassifierWeb serverRef
Homo sapiens2005SVM-RevcKmerRevcKmerSVMNo[12]
2014SVM-PseKNCPseKNCSVMNo[13]
2016iDHS-ELKmer, RevKmer, PseDNCRFYes[14]
2017iDHSs-PseTNCPseTNCDeep Sparse Auto-encoderNo[15]
2018DHSpredNC, PPSVMNo[16]
2019iDHS-DMCACDPMSVMNo[17]
2020iDHS-DSAMSDSAEnsemble bagged treeNo[18]
2020iDHS-DXGKmer, Mismatch, DSAXGboostNo[19]
Arabidopsis, Rice2017pDHS-SVMRevKmer, DACSVMNo[20]
2018pDHS-WEKmer, RevKmer, Mismatch, PesKNC, ACRFNo[21]
2018pDHS-ELMRevKmer, PseDNCExtreme learning machineNo[22]
2019pDHS-DSETKmer, RevKmer, Mismatch, PseDNCSVMNo[23]
Table 1

A comprehensive list of published methods for prediction DHS

SpeciesYearNameFeature extractionClassifierWeb serverRef
Homo sapiens2005SVM-RevcKmerRevcKmerSVMNo[12]
2014SVM-PseKNCPseKNCSVMNo[13]
2016iDHS-ELKmer, RevKmer, PseDNCRFYes[14]
2017iDHSs-PseTNCPseTNCDeep Sparse Auto-encoderNo[15]
2018DHSpredNC, PPSVMNo[16]
2019iDHS-DMCACDPMSVMNo[17]
2020iDHS-DSAMSDSAEnsemble bagged treeNo[18]
2020iDHS-DXGKmer, Mismatch, DSAXGboostNo[19]
Arabidopsis, Rice2017pDHS-SVMRevKmer, DACSVMNo[20]
2018pDHS-WEKmer, RevKmer, Mismatch, PesKNC, ACRFNo[21]
2018pDHS-ELMRevKmer, PseDNCExtreme learning machineNo[22]
2019pDHS-DSETKmer, RevKmer, Mismatch, PseDNCSVMNo[23]
SpeciesYearNameFeature extractionClassifierWeb serverRef
Homo sapiens2005SVM-RevcKmerRevcKmerSVMNo[12]
2014SVM-PseKNCPseKNCSVMNo[13]
2016iDHS-ELKmer, RevKmer, PseDNCRFYes[14]
2017iDHSs-PseTNCPseTNCDeep Sparse Auto-encoderNo[15]
2018DHSpredNC, PPSVMNo[16]
2019iDHS-DMCACDPMSVMNo[17]
2020iDHS-DSAMSDSAEnsemble bagged treeNo[18]
2020iDHS-DXGKmer, Mismatch, DSAXGboostNo[19]
Arabidopsis, Rice2017pDHS-SVMRevKmer, DACSVMNo[20]
2018pDHS-WEKmer, RevKmer, Mismatch, PesKNC, ACRFNo[21]
2018pDHS-ELMRevKmer, PseDNCExtreme learning machineNo[22]
2019pDHS-DSETKmer, RevKmer, Mismatch, PseDNCSVMNo[23]

Although above-mentioned machine learning approaches [13–24] have taken an initial step toward prediction of DHSs in human and plants genome, there are no published methods for the identification of DHSs in mouse genome, especially for organ-specific or developmental stages-specific DHSs in mouse. In view of this, we introduce iDHS-Deep, a deep learning approach, for unambiguous determination of DHSs in various tissues and developmental stages in mouse genome. iDHS-Deep exhibited a superior performance for predicting DHSs in the experiments of 5-fold cross-validation and independent datasets validation. Finally, an open source ensemble tool was provided at http://lin-group.cn/server/iDHS-Deep/, which could predict whether an unknown sequence is potential DHS with certain developmental time point.

MATERIALS AND METHODS

Data collection and preprocessing

Breeze et al. [25] used DNase-seq to profile the regulatory landscape across the late embryonic and fetal stages of mouse development and create a comprehensive atlas comprising the DHSs of distinct tissues and developmental time points. Based on their study, we collected DHS sequences from multiple tissues and developmental stages in mouse genome.

To obtain reliable datasets, we drew the length distribution of DHSs and the distance distribution between two adjacent DHSs in Figure 2A and B. Further analysis found that more than 94.87% of DHSs were less than 300 bp in length, and more than 63.44% of the adjacent DHSs were less than 10 000 bp. Based on these investigations, we constructed positive and negative samples. In order to ensure the stability of the model, we only selected DHSs whose sequence length is between 50 and 300 bp as positive samples. For the construction of negative samples, firstly, the sequence fragments located between the adjacent DHSs with length more than 10 000 bp were selected as candidate negative samples. Subsequently, we set the coordinate of center point of each selected sequence fragments and take the list of coordinates on the flanking regions of p according to the rule (p|$\pm$|1000|$\times$|n, n = 0,1,2, …) until the distances from two coordinates to both terminals of the sequence fragments are less than 2000 bp. Finally, the sequence fragments flanking these coordinates with arbitrary length in ranges of 50–300 bp were extracted as non-DHS samples.

The information of DHSs in mouse genome. The length distribution of DHSs (A). The distance distribution between two adjacent DHSs (B). The proportion of cell types in the benchmark dataset (C). The proportion of different developmental stages in the benchmark dataset of each cell type (D). The proportion of different developmental stages in the benchmark dataset (E).
Figure 2

The information of DHSs in mouse genome. The length distribution of DHSs (A). The distance distribution between two adjacent DHSs (B). The proportion of cell types in the benchmark dataset (C). The proportion of different developmental stages in the benchmark dataset of each cell type (D). The proportion of different developmental stages in the benchmark dataset (E).

As the evaluation of proposed model on redundant samples will overestimate the performance of the model, the high similar sequences must be excluded. Here, we used the CD-HIT program [26] to remove redundant samples with a sequence identity cut-off of 0.8. On the basis of above steps, benchmark datasets were obtained. We provide the proportion of DHSs in different cell types (Figure 2C) and developmental stages (Figure 2E), respectively. As it can be seen from the Figure 2D, the dataset for each cell type is composed of one or more data of different developmental stages. Generally, an independent dataset should be established for objectively evaluation proposed models. Therefore, we divided the final datasets into training datasets and independent datasets in a ratio of 7:3 [27]. Details about the data are listed in Supplementary Table S1 available online at http://bib.oxfordjournals.org/.

Design of iDHS-Deep model and training strategy

The deep learning techniques provide new strategies to automatically detect the discriminative features based on the neural network models [28–34]. The Convolutional Neural Network (CNN) [35] and the Long Short-Term Memory (LSTM) [36] are two most widely used deep neural networks [37], because they can capture motifs, domains and extract global sequence order information. Thus, we used CNN and LSTM to construct a prediction model, called iDHS-Deep, to predict DHSs. The model consists of three modules: input module, feature extraction module and classification module. This network architecture is shown in Figure 3.

Visualization of the detailed architecture of iDHS-Deep.
Figure 3

Visualization of the detailed architecture of iDHS-Deep.

In the input module, the input is the DHS or non-DHS sequence. Different from traditional binary coding strategy that has been widely used in DNA encode [38–40], we directly assigned different positive integer values to the four bases. Thus, an arbitrary DNA sequence can be converted into a string of numbers. For the DNA sequences with length less than 300 bp, we filled in 0 to make up length of 300 by the function pad_sequence. After that, a two-dimensional integer tensor with the shape of (samples, sequence_length) is generated. And then, embedding layer can convert a two-dimensional integer tensor into a three-dimensional floating-point tensor with the shape of (samples, sequence_length, embedding_dimensionality). Here, the ‘samples’ was the number of total of DHS and non-DHS sequences, the value of ‘sequence_length’ was 300 and the size of ‘embedding_dimensionality’ was set as 128. Finally, the three-dimensional floating-point tensor was used as the input of the feature extraction module.

The feature extraction module is in charge of finding the effective features from DNA sequences [41]. Its basic architecture consists of five distinct layers: first convolutional layer, first pooling layer, second convolutional layer, second pooling layer and one LSTM layer. A convolution layer was used for feature extraction, together with a rectifier operation (ReLU) to propagate positive outputs and eliminate negative outputs. Then, a max-pooling layer was used to reduce dimensions and help extract higher-level features. In this integration module, layers of convolution and pooling enabled the network to extract features from larger spatial ranges and potentially capture interactions between sequence motifs. And a following LSTM layer further captured short- and long-term dependencies in sequences and extracted the context features from the pooled sequence patterns [42].

In the classification module, a fully-connected layer [43] followed the LSTM was applied to mine the deep hidden DNA sequence features and the global sequence order information. Finally, the output vector of the fully connected layer was used as the input of the softmax layer, which generated the corresponding classification probability to a query sequence.

Implementation and training of iDHS-Deep

In this study, we used Keras (2.2.2) [44] with the backend of Tensorflow (1.2.1) [45] to implement the iDHS-Deep. The output node uses sigmoid function as activation function, whereas all the other nodes use rectified linear function (ReLU) [46] as activation function. In order to avoid the overfitting and internal covariate shift, the dropout technique was employed. And the batch size was set as 128.

Cross-validation is generally used to evaluate the performances of a model [47]. In this verification, the original dataset is grouped, one part is used as the training set and the other part is used as the validation set. Afterward, the classifier is trained with the training set, and tested with the validation set to evaluate the performance of the model. In practice, the generalization ability of a machine learning model should be further tested on a new independent dataset. Therefore, we firstly used the 5-fold cross-validation strategy to examine the classification model. Once the model was determined, an independent dataset examination was applied to further evaluate the model’s performances.

In the cross-validation and independent dataset test, four metrics were used to evaluate the performances of proposed predictor, including sensitivity, specificity, overall accuracy and Matthew’s correlation coefficient [48–51]. In addition, the area under the receiver operating characteristic (AUROC) [52] was also calculated to quantitatively and objectively evaluate the predictive ability of model.

RESULTS AND DISCUSSIONS

iDHS-Deep accurately predicts DNase I hypersensitive sites

According to the description of Materials and methods, DHSs classification models were established using 5-fold cross-validation on the training datasets of tissues and developmental stages. In order to show the performance of models visually, the AUROC values were drawn in Figure 4A and B. We noticed that the AUROCs yielded by models based on different tissues and different developmental stages are 0.88–0.96 and 0.90–0.95, respectively. Obviously, iDHS-Deep achieved good performance for detecting DHSs, indicating that the architecture of CNN-LSTM combination is feasible and effective for classifying DHSs.

The 5-fold cross-validation, independent datasets validation and cross-cell validation were used to analyze the robustness and reliability of propose models. The histogram to show the performances of the 5-fold cross-validation of the training datasets and independent datasets validation, it is generated by the DHS datasets of tissues (A) and developmental stages (B), respectively. The heat map showing the values of AUROC in cross-cell types validation (C) and cross-developmental stages validation (D). Once a cell-specific model was established on its own training dataset in rows, it was validated on the data from the same cell as well as the independent data from the other datasets in columns.
Figure 4

The 5-fold cross-validation, independent datasets validation and cross-cell validation were used to analyze the robustness and reliability of propose models. The histogram to show the performances of the 5-fold cross-validation of the training datasets and independent datasets validation, it is generated by the DHS datasets of tissues (A) and developmental stages (B), respectively. The heat map showing the values of AUROC in cross-cell types validation (C) and cross-developmental stages validation (D). Once a cell-specific model was established on its own training dataset in rows, it was validated on the data from the same cell as well as the independent data from the other datasets in columns.

To further assess the robustness and generalization of the models generated above, we examined the proposed model’s performance on independent datasets. The details about evaluation were recorded in Supplementary Table S2 available online at http://bib.oxfordjournals.org/. It showed that our model can achieve AUROC values of 0.89–0.95 and 0.90–0.94, respectively, for tissues (Figure 4A) and developmental stages (Figure 4B). These satisfactory results suggested that our proposed classification method iDHS-Deep has capability to identify the potential DHSs in different tissues or different developmental stages.

Compared iDHS-Deep with other published methods.
Figure 5

Compared iDHS-Deep with other published methods.

To provide an available tool, we built a user-friendly online web server called iDHS-Deep based on above-mentioned models, which can be accessed at http://lin-group.cn/server/iDHS-Deep. It is the first web server for mouse genome to identify whether an unknown sequence is a potential DHS sequence and which the development stage of this potential DHS belongs to. The powerful and robust tool could provide convenience to most of the scholars without computer or mathematic background.

The distribution of DHS on known genes (A) and the distribution of GC-content along DHS sequences for four developmental stages (B).
Figure 6

The distribution of DHS on known genes (A) and the distribution of GC-content along DHS sequences for four developmental stages (B).

Cross-tissues/developmental stages validation

After obtaining all robust classification models based on different tissues and developmental stages, one naturally wonder whether a model trained with the data from one tissue/developmental stage could recognize the DHS sequences in other tissues/developmental stages. To examine this potential relationship between different tissues and developmental stages, we further conducted cross-tissues/developmental stages validation. Based on the knowledge of transfer information [53], we trained the model on one tissue or developmental stage and then predicted DHS sequences in other tissues or developmental stages. For the convenience to observe, the AUROC values were represented by a heat map in Figure 4C and D to describe the prediction performance of cross-tissues and cross-developmental stages validation. The models in rows were tested on the other datasets in columns.

We observed that all calculated AUROC values are greater than 0.81 and 0.87, respectively, for the cross-tissues validation (Figure 4C) and cross-developmental stages validation (Figure 4D). These results suggested that DHS sequences of one specific tissue or developmental stage can be accurately identified as potential DHSs by any other models constructed on other tissue or developmental stage data. Especially, almost all models tend to achieve better results (AUROCs >0.89) on heart-based dataset in cross-tissue validation. In addition, due to the low heterogeneity among the forebrain, midbrain and hindbrain of mouse, it is expected that the cross-tissues examinations could produce better performance (AUROCs >0.92). Furthermore, we found most models based on themselves datasets always achieve satisfactory AUROCs (Figure 4C and D), which further proved the robustness and stability of the proposed models.

Compared with published methods

For further proving the superiority of our proposed method, we need to compare our proposed method with other published methods. Up to now, there have been published many excellent machine learning methods for predicting DHSs in human genome and plant genome (Table 1). But, it should be noted that these published tools were not based on the DHS datasets of mouse genome. To provide a fair comparison, we rebuilt the models of some typical published methods [13, 15, 16, 19, 20] by adopting the same assessment criteria and benchmark dataset of neural tube. The corresponding performances of models were produced and recorded in Supplementary Table S3 available online at http://bib.oxfordjournals.org/. As shown in Figure 5 available online at http://bib.oxfordjournals.org/, all the evaluation metrics suggested that iDHS-Deep is superior to other models. This result indicated that our proposed method is powerful and reliable for DHS identification in mouse genome.

Enrichment of DHSs within known genes

To study the pattern of DHSs in different regions of genes, we collected known gene of mouse from UCSC [54]. The coordinates of DHSs were mapped within a region that spanned the known gene as well as 2 kb upstream and downstream of transcription start site (TSS). The distribution of DHSs in this region was plotted a 2D plane in Figure 6A. It can be easily observed that DHSs are significant enriched in the range of 500 bp downstream of TSS (marked by dark rectangle). That is to say, DHSs are enriched nearby regions of known genes, which is consistent with previous study that DHSs in human genome were genetically enriched nearby or within known genes [55].

Enrichment of DHSs within GC-rich regions

The different development stages in mammalian are orchestrated by genome-encoded regulatory elements to create a dynamic chromatin landscape. Therefore, it is conceivable that DHSs also change dynamically at the sequence level at different stages of development. Because the vast majority of DHSs are less than 300 bp in length in Figure 2A, we selected all DHSs with lengths less than 300 bp for further sequence analysis. And then, the sequence fragments from 300 bp upstream to 400 bp downstream the midpoint of these DHSs sequences were took out. Next, a slide window of length 100 bp was used to calculate GC content [56] of each base. Finally, the obtained GC content of DHS sequences at different development stages was shown in Figure 6B, in which the dash line is the average GC-content in mouse genomes and the dark rectangle represents the DHS region.

First of all, we can see that the DHSs are located in GC-rich regions, where the GC-content exceeded the average. This was hardly an unexpected observation as GC-rich regions are likely to be the sites for regulatory signals and have significantly more hypersensitive to DNase digestion than random coordinates. Secondly, there were differences in the GC contents of the DHS regions in the four different developmental stages. Especially, the GC contents in the ESC and Early Fetal development stages were higher than in Late Fetal and Adult. Sequences with high GC-content perhaps have a higher degree of DNA methylation, which may reduce the levels of DNA transcription and gene expression. Therefore, the gene expression levels for different developmental stages of mouse are in a dynamic process.

CONCLUSIONS

The recognition of DHSs contributes to identify the precise location of many different regulatory elements in specific, well-studied genes. Here, CNN-LSTM-based method, named iDHS-Deep, was developed to identify DHSs in different tissues and developmental stages in mouse genome with the sequence-based features. The performances of models have been demonstrated using both 5-fold cross-validation and independent testing. Good prediction performance of proposed method also demonstrated that CNN-LSTM can extract effective features for predicting DHSs in mouse. According to the proposed model, we built a freely web server to judge whether an unknown sequence would be a potential DHS. Furthermore, the DHS sequences always enrich within regions of known genes, especially, nearby the gene 5′ UTR. High GC content in DHSs suggested dynamic regulation by epigenetics during different tissue and development stage. We hope the work could provide a theoretical guide for the research of gene expression mechanisms or experimental research in related fields.

Data availability

We provide the Python source code of iDHS-Deep model training, which is freely available at http://lin-group.cn/server/iDHS-Deep/download.html.

Author contributions

Conceptualization: H. Lin and F.-Y.D. Investigation: F.-Y.D., H. Lv and Z.-J.S. Coding: F.-Y.D., W.S. and Q.-L.H. Writing—Original Draft: F.-Y.D. and H. Lv. Writing—Review and Editing: H. Lin. Funding acquisition: H. Lin.

Key Points
  • We provide the first machine learning method to identify the DHSs in mouse genome with CNN-LSTM-based model.

  • Experiments based on cross-validation and independent datasets validation demonstrate the excellent superiority of iDHS-Deep.

  • We briefly discussed the distribution of DHSs within known genes and GC-rich regions.

  • A user-friendly web-server iDHS-Deep at http://lin-group.cn/server/iDHS-Deep/ was built to detect DHSs in different tissues and developmental stages in mouse genome.

Funding

National Nature Scientific Foundation of China (61772119); Sichuan Provincial Science Fund for Distinguished Young Scholars (2020JDJQ0012).

Declaration of interests

The authors declare that they have no competing interests

Fu-Ying Dao is a PhD candidate of Center for Informational Biology at University of Electronic Science and Technology of China. Her research interests include bioinformatics.

Hao Lv is a PhD candidate of Center for Informational Biology at University of Electronic Science and Technology of China. His research interests include bioinformatics.

Wei Su is a MS candidate of Center for Informational Biology at University of Electronic Science and Technology of China. His research interests include bioinformatics.

Zi-Jie Sun is a MS candidate of Center for Informational Biology at University of Electronic Science and Technology of China. Her research interests are bioinformatics and machine learning.

Qin-Lai Huang is a MS candidate of Center for Informational Biology at University of Electronic Science and Technology of China. His research interests are bioinformatics.

Hao Lin is a professor of Center for Informational Biology at University of Electronic Science and Technology of China. His research is in the areas of bioinformatics and system biology.

References

1.

Elgin
SC
.
DNAase I-hypersensitive sites of chromatin
.
Cell
1981
;
27
:
413
5
.

2.

Wittkopp
PJ
,
Kalay
G
.
Cis-regulatory elements: molecular mechanisms and evolutionary processes underlying divergence
.
Nat Rev Genet
2012
;
13
:
59
69
.

3.

Li
H
,
Ta
N
,
Long
C
, et al.
The spatial binding model of the pioneer factor Oct4 with its target genes during cell reprogramming
.
Comput Struct Biotechnol J
2019
;
17
:
1226
33
.

4.

Meuleman
W
,
Muratov
A
,
Rynes
E
, et al.
Index and biological spectrum of human DNase I hypersensitive sites
.
Nature
2020
;
584
:
244
51
.

5.

M
DA
,
Weghorn
D
,
DA-C
A
, et al.
Identifying DNase I hypersensitive sites as driver distal regulatory elements in breast cancer
.
Nat Commun
2017
;
8
:
436
.

6.

Carrasquillo
MM
,
Allen
M
,
Burgess
JD
, et al.
A candidate regulatory variant at the TREM gene cluster associates with decreased Alzheimer's disease risk and increased TREML1 and TREM2 brain gene expression
.
Alzheimers Dement
2017
;
13
:
663
73
.

7.

Mokry
M
,
Harakalova
M
,
Asselbergs
FW
, et al.
Extensive association of common disease variants with regulatory sequence
.
PLoS One
2016
;
11
:e0165893.

8.

He
Y
,
Carrillo
JA
,
Luo
J
, et al.
Genome-wide mapping of DNase I hypersensitive sites and association analysis with gene expression in MSB1 cells
.
Front Genet
2014
;
5
:
308
.

9.

Lu
F
,
Liu
Y
,
Inoue
A
, et al.
Establishing chromatin regulatory landscape during mouse preimplantation development
.
Cell
2016
;
165
:
1375
88
.

10.

Morin
A
,
Kwan
T
,
Ge
B
, et al.
Immunoseq: the identification of functionally relevant variants through targeted capture and sequencing of active regulatory regions in human immune cells
.
BMC Med Genomics
2016
;
9
:
59
.

11.

Song
L
,
Crawford
GE
.
DNase-seq: a high-resolution technique for mapping active gene regulatory elements across the genome from mammalian cells
.
Cold Spring Harb Protoc
2010
;
2010
:pdb prot5384.

12.

Chen
Y
,
Chen
A
.
Unveiling the gene regulatory landscape in diseases through the identification of DNase I-hypersensitive sites
.
Biomed Rep
2019
;
11
:
87
97
.

13.

Noble
WS
,
Kuehn
S
,
Thurman
R
, et al.
Predicting the in vivo signature of human gene regulatory sequences
.
Bioinformatics
2005
;
21
(
Suppl 1
):
i338
43
.

14.

Feng
P
,
Jiang
N
,
Liu
N
.
Prediction of DNase I hypersensitive sites by using pseudo nucleotide compositions
.
Scientific World Journal
2014
;
2014
:
740506
.

15.

Liu
B
,
Long
R
,
Chou
KC
.
iDHS-EL: identifying DNase I hypersensitive sites by fusing three different modes of pseudo nucleotide composition into an ensemble learning framework
.
Bioinformatics
2016
;
32
:
2411
8
.

16.

Xu
ZC
,
Jiang
SY
,
Qiu
WR
, et al.
iDHSs-PseTNC: identifying DNase I hypersensitive sites with pseuo trinucleotide component by deep sparse auto-encoder
.
Letters in Organic Chemistry
2017
;
14
:655–664.

17.

Manavalan
B
,
Shin
TH
,
Lee
G
.
DHSpred: support-vector-machine-based human DNase I hypersensitive sites prediction using the optimal features selected by random forest
.
Oncotarget
2018
;
9
:
1944
56
.

18.

Liang
Y
,
Zhang
S
.
iDHS-DMCAC: identifying DNase I hypersensitive sites with balanced dinucleotide-based detrending moving-average cross-correlation coefficient
.
SAR QSAR Environ Res
2019
;
30
:
429
45
.

19.

Zhang
S
,
Yu
Q
,
He
H
, et al.
iDHS-DSAMS: identifying DNase I hypersensitive sites based on the dinucleotide property matrix and ensemble bagged tree
.
Genomics
2020
;
112
:
1282
9
.

20.

Zhang
S
,
Xue
T
.
Use Chou's 5-steps rule to identify DNase I hypersensitive sites via dinucleotide property matrix and extreme gradient boosting
.
Mol Genet Genomics
2020
;
295
:
1431
42
.

21.

Zhang
S
,
Zhou
Z
,
Chen
X
, et al.
pDHS-SVM: a prediction method for plant DNase I hypersensitive sites based on support vector machine
.
J Theor Biol
2017
;
426
:
126
33
.

22.

Zhang
S
,
Zhuang
W
,
Xu
Z
.
Prediction of DNase I hypersensitive sites in plant genome using multiple modes of pseudo components
.
Anal Biochem
2018
;
549
:
149
56
.

23.

Zhang
S
,
Chang
M
,
Zhou
Z
, et al.
pDHS-ELM: computational predictor for plant DNase I hypersensitive sites based on extreme learning machines
.
Mol Genet Genomics
2018
;
293
:
1035
49
.

24.

Zhang
S
,
Lin
J
,
Su
L
, et al.
pDHS-DSET: prediction of DNase I hypersensitive sites in plant genome using DS evidence theory
.
Anal Biochem
2019
;
564-565
:
54
63
.

25.

Breeze
CE
,
Lazar
J
,
Mercer
T
, et al.
Atlas and developmental dynamics of mouse DNase I hypersensitive sites
.
bioRxiv
2020
; doi: .

26.

Fu
L
,
Niu
B
,
Zhu
Z
, et al.
CD-HIT: accelerated for clustering the next-generation sequencing data
.
Bioinformatics
2012
;
28
:
3150
2
.

27.

Dao
FY
,
Lv
H
,
Zulfiqar
H
, et al.
A computational platform to identify origins of replication sites in ukaryotes
.
Brief Bioinform
2020
; doi: .

28.

Li
FY
,
Chen
JX
,
Leier
A
, et al.
DeepCleave: a deep learning predictor for caspase and matrix metalloprotease substrates and cleavage sites
.
Bioinformatics
2020
;
36
:
1057
65
.

29.

Si
D
,
Moritz
SA
,
Pfab
J
, et al.
Deep learning to predict protein backbone structure from high-resolution Cryo-EM density maps
.
Sci Rep
2020
;
10
:4282.

30.

Stephenson
N
,
Shane
E
,
Chase
J
, et al.
Survey of machine learning techniques in drug discovery
.
Curr Drug Metab
2019
;
20
:
185
93
.

31.

Cao
RZ
,
Bhattacharya
D
,
Hou
J
, et al.
DeepQA: improving the estimation of single protein model quality with deep belief networks
.
BMC Bioinformatics
2016
;
17
:495.

32.

Dao
FY
,
Lv
H
,
Zhang
D
, et al.
DeepYY1: a deep learning approach to identify YY1-mediated chromatin loops
.
Brief Bioinform
2020
; doi: .

33.

Wang
J
,
Wang
H
,
Wang
X
, et al.
Predicting drug-target interactions via FM-DNN learning
.
Current Bioinformatics
2020
;
15
:
68
76
.

34.

Zou
Q
.
Latest machine learning techniques for biomedicine and bioinformatics
.
Current Bioinformatics
2019
;
14
:
176
7
.

35.

Valueva
MV
,
Nagornov
NN
,
Lyakhov
PA
, et al.
Application of the residue number system to reduce hardware costs of the convolutional neural network implementation
.
Mathematics and Computers in Simulation
2020
;
177
:
232
43
.

36.

Hochreiter
S
,
Schmidhuber
J
.
Long short-term memory
.
Neural Comput
1997
;
9
:
1735
80
.

37.

Schmidhuber
J
.
Deep learning in neural networks: an overview
.
Neural Netw
2015
;
61
:
85
117
.

38.

Chen
Z
,
Zhao
P
,
Li
F
, et al.
iLearn: an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA
.
RNA and protein sequence data, Brief Bioinform
2020
;
21
:
1047
57
.

39.

Liu
B
,
Gao
X
,
Zhang
HY
.
BioSeq-Analysis2.0: an updated platform for analyzing DNA, RNA and protein sequences at sequence level and residue level based on machine learning approaches
.
Nucleic Acids Res
2019
;
47
:e127.

40.

Lv
H
,
Dao
FY
,
Zhang
D
, et al.
iDNA-MS: an integrated computational tool for detecting DNA modification sites in multiple genomes
.
iScience
2020
;
23
:
100991
.

41.

Liu
D
,
Li
G
,
Zuo
Y
.
Function determinants of TET proteins: the arrangements of sequence motifs with specific codes
.
Brief Bioinform
2019
;
20
:
1826
35
.

42.

Donahue
J
,
Hendricks
LA
,
Rohrbach
M
, et al.
Long-term recurrent convolutional networks for visual recognition and description
.
IEEE Trans Pattern Anal Mach Intell
2017
;
39
:
677
91
.

43.

Schwing
AG
,
Urtasun
R
.
Fully connected deep structured networks
,
arXiv preprint
2015
; arXiv:1503.02351.

44.

Chollet
F
.
Keras: Deep learning library for theano and tensorflow
. [Online]
2015
; Available: https://keras.io/.

45.

Girija
SS.
Tensorflow: large-scale machine learning on heterogeneous distributed systems. Software available from tensorflow. arXiv preprint 2016; arXiv:1603.04467.

46.

Agarap
AF.
Deep learning using rectified linear units (relu)
.
arXiv preprint
2018
; arXiv:1803.08375.

47.

Stone
M
.
Cross-validatory choice and assessment of statistical predictions
.
J R Stat Soc B Methodol
1974
;
36
:
111
33
.

48.

Liu
B
,
Han
L
,
Liu
X
, et al.
Computational prediction of Sigma-54 promoters in bacterial genomes by integrating motif finding and machine learning strategies
.
IEEE/ACM Trans Comput Biol Bioinform
2019
;
16
:
1211
8
.

49.

Charoenkwan
P
,
Nantasenamat
C
,
Hasan
MM
, et al.
iTTCA-Hybrid: improved and robust identification of tumor T cell antigens by utilizing hybrid feature representation
.
Anal Biochem
2020
;
599
:
113747
.

50.

Liu
KW
,
Chen
W
.
iMRM: a platform for simultaneously identifying multiple kinds of RNA modifications
.
Bioinformatics
2020
;
36
:
3336
42
.

51.

Manavalan
B
,
Subramaniyam
S
,
Shin
TH
, et al.
Machine-learning-based prediction of cell-penetrating peptides and their uptake efficiency with improved accuracy
.
J Proteome Res
2018
;
17
:
2715
26
.

52.

Cao
R
,
Lopez-de-Ullibarri
IROC
.
Curves for the statistical analysis of microarray data
.
Methods Mol Biol
2019
;
1986
:
245
53
.

53.

Mazo
C
,
Bernal
J
,
Trujillo
M
, et al.
Transfer learning for classification of cardiovascular tissues in histological images
.
Comput Methods Programs Biomed
2018
;
165
:
69
76
.

54.

Fujita
PA
,
Rhead
B
,
Zweig
AS
, et al.
The UCSC genome browser database: update 2011
.
Nucleic Acids Res
2011
;
39
:
D876
82
.

55.

Crawford
GE
,
Holt
IE
,
Mullikin
JC
, et al.
Identifying gene regulatory elements by genome-wide recovery of DNase hypersensitive sites
.
Proc Natl Acad Sci U S A
2004
;
101
:
992
7
.

56.

Han
X
,
Wang
R
,
Zhou
Y
, et al.
Mapping the mouse cell atlas by microwell-Seq
.
Cell
2018
;
173
:
1307
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Supplementary data