Abstract

Meiotic recombination is one of the most important driving forces of biological evolution, which is initiated by double-strand DNA breaks. Recombination has important roles in genome diversity and evolution. This review firstly provides a comprehensive survey of the 15 computational methods developed for identifying recombination hotspots in Saccharomyces cerevisiae. These computational methods were discussed and compared in terms of underlying algorithms, extracted features, predictive capability and practical utility. Subsequently, a more objective benchmark data set was constructed to develop a new predictor iRSpot-Pse6NC2.0 (http://lin-group.cn/server/iRSpot-Pse6NC2.0). To further demonstrate the generalization ability of these methods, we compared iRSpot-Pse6NC2.0 with existing methods on the chromosome XVI of S. cerevisiae. The results of the independent data set test demonstrated that the new predictor is superior to existing tools in the identification of recombination hotspots. The iRSpot-Pse6NC2.0 will become an important tool for identifying recombination hotspot.

Introduction

Meiotic recombination is one of the most important driving forces of biological evolution, which is initiated by double-strand DNA breaks (DSBs) [1, 2]. Genetic recombination describes the generation of new combinations of alleles that occurs at each generation in diploid organisms [3, 4]. Thus, recombination has important roles in genome diversity and evolution. For example, it can help homologous chromosomes pair and alter genome structure by disrupting linkage of sequence polymorphisms on the same DNA molecule [5–8].

Recombination does not occur randomly on chromosomes. Some genomic regions in which meiotic DSBs occur at relatively high frequencies are called recombination hotspots, while the other regions with a lower frequency of DSBs occurrence are referred to as recombination coldspots. Requirements to form DSBs include open chromatin structure, presence of certain histone modifications and binding of sequence-specific transcription factors at some loci [9, 10]. Formation of DNA DSBs also requires the covalent binding of topoisomerase-related Spo11 protein to small DNA fragments [11].

Investigations on recombination events and identification of hotspots are significant for understanding the mechanism of recombination initiation in Saccharomyces cerevisiae. Traditionally, the discovery of recombination hotspots mainly relies on experimental methods. Several studies have been carried out for the in-depth study of recombination hotspots. Gerton et al. [1] used DNA microarray as the single-gene resolution method to estimate the DSBs formation adjacent to each Open Reading Frame (ORF) for the S. cerevisiae. Pan et al. [11] uncovered a genome-wide DSBs map of unprecedented resolution and sensitivity by sequencing the small DNA fragments that bound to accumulation of Spo11 protein covalently. However, experimental methods are labour-intensive and cost-ineffective. In particular, in the post-genomic age, with the appearance of more and more genome data, these genome data are critical for both basic research and applications. Machine learning-based approaches are generally robust and effective in dealing with biological problems and may play a complementary role to wet experimental techniques. Therefore, it is very important to apply machine learning algorithm to recognize recombination hotspots.

During the past few years, some methods have been developed to identify recombination hotspots. Zhou et al. [12] firstly established a support vector machine (SVM)-based model to discriminate recombination hotspots from coldspots in S. cerevisiae by using codon composition. Subsequently, Jiang et al. [13] developed a new model based on gapped dinucleotide composition and random forest (RF) to predict meiotic recombination hotspots and coldspots in S. cerevisiae. Later on, Liu et al. [14] proposed an increment of diversity combined with quadratic discriminant (IDQD) model to predict recombination hotspots. In 2013, Chen et al. [15] designed a new DNA sample descriptor called pseudo dinucleotide composition (PseDNC) to improve the prediction accuracy [16]. Inspired by the concept of PseDNC, Li et al. [17] and Qiu et al. [18] also developed similar models to predict recombination hotspots, respectively. In 2014, Zhang and Liu [19] presented a model for predicting recombination hotspots based on the sequence and nucleosome occupancy in yeast. Subsequently, Liu et al. [20] incorporated the weight of features into the prediction model to identify recombination hotspots. To improve the performance, Dong et al. [21] combined the Z curve with PseDNC to distinguish recombination hotspots and coldspots. By fusing different modes of pseudo K-tuple nucleotide composition (PseKNC) and modes of dinucleotide-based autocross covariance into an ensemble classifier of clustering approach, another tool called iRSpot-EL was established [22]. In 2016, Kabir and Hayat [23] proposed a genetic algorithm-based ensemble model named iRSpot-GAEnsC for the discrimination between recombination hotspots and coldspots. In 2018, Zhang et al. constructed two models, namely iRSpot-ADPM [24] and iRSpot-PDI [25], to recognize recombination hotspots. Maruf and Shatabda [26] built a tool named iRSpot-SF for identifying recombination hotspots. More recently, Yang et al. [27] designed a powerful predictor called iRSpot-Pse6NC to identify recombination hotspots by incorporating the key hexamer compositions.

Although the models mentioned above reported good performances on recombination hotspots identification, they were trained and tested on different benchmark data sets, which prevent users from selecting the optimal model to perform predictions. Moreover, with the accumulation of more and more sequenced data, is there available objective and strict data that can be used to train a more powerful model? Thus, in this review, we provide a comprehensive survey of the most up-to-date progress of large-scale computational studies on recombination hotspots prediction. In total, we summarized the 15 methods listed in Table 1 in the following aspects, namely benchmark data set construction, feature extraction, prediction algorithm, performance evaluation strategies and software usability.

Table 1

A comprehensive list of methods for prediction recombination hotspots of S. cerevisiae

MethodsWeb serverData setFeature extractionFeature selectionAlgorithmEvaluation strategy
SVM-FCUNoS1FCUNoneSVM-RBF10-fold CV
RF-DYMHCYesS1, S2Gapped dinucleotide composition featuresNoneRF5-fold CV and Jackknife validation
IDQD-4-merNoS1, S2, S34-merNoneIDQD5-fold CV
IRSpot-PseDNCYesS1, S2, S3PseDNCNoneSVM-RBF5-fold CV and Jackknife validation
SVM-NACPseDNCNoS1Nucleic acid composition (NAC), n-tier NAC and PseDNCNoneSVM-linear kernelJackknife validation
IRSpot-TNCPseAACYesS1, S2, S3TNC and PseAACNoneSVM-RBFJackknife validation
Weighted featuresNoS1k-mer, physical propertiesNoneWeighted features5-fold CV
IRSpot-ELYesS2PseKNC, DACCF-scoreSVM-RBF5-fold CV
IDQD-Nu-occNoS14-mer, Nu-occNoneIDQD5-fold CV
IRSpot-GAEnsCNoS1TNC, DNCNoneANN + SVM + RFJackknife validation
HcsPredictorYesS1PseDNC+ZNoneLDM5-fold CV and Jackknife validation
IRSpot-ADPMNoS1, S2ADPMWrapper SVMSVM-RBF5-fold CV and Jackknife validation
IRSpot-PDINoS1, S2PDIWrapper SVMSVM-RBFJackknife validation
IRSpot-SFYesS2Gapped dinucleotide, TF-IDF, K-mer, Reverse K-merSVM-RFESVM-RBF5-fold CV and Jackknife validation
IRSpot-Pse6NCYesS3Pse6NCBinomial distributionSVM-RBF5-fold CV
MethodsWeb serverData setFeature extractionFeature selectionAlgorithmEvaluation strategy
SVM-FCUNoS1FCUNoneSVM-RBF10-fold CV
RF-DYMHCYesS1, S2Gapped dinucleotide composition featuresNoneRF5-fold CV and Jackknife validation
IDQD-4-merNoS1, S2, S34-merNoneIDQD5-fold CV
IRSpot-PseDNCYesS1, S2, S3PseDNCNoneSVM-RBF5-fold CV and Jackknife validation
SVM-NACPseDNCNoS1Nucleic acid composition (NAC), n-tier NAC and PseDNCNoneSVM-linear kernelJackknife validation
IRSpot-TNCPseAACYesS1, S2, S3TNC and PseAACNoneSVM-RBFJackknife validation
Weighted featuresNoS1k-mer, physical propertiesNoneWeighted features5-fold CV
IRSpot-ELYesS2PseKNC, DACCF-scoreSVM-RBF5-fold CV
IDQD-Nu-occNoS14-mer, Nu-occNoneIDQD5-fold CV
IRSpot-GAEnsCNoS1TNC, DNCNoneANN + SVM + RFJackknife validation
HcsPredictorYesS1PseDNC+ZNoneLDM5-fold CV and Jackknife validation
IRSpot-ADPMNoS1, S2ADPMWrapper SVMSVM-RBF5-fold CV and Jackknife validation
IRSpot-PDINoS1, S2PDIWrapper SVMSVM-RBFJackknife validation
IRSpot-SFYesS2Gapped dinucleotide, TF-IDF, K-mer, Reverse K-merSVM-RFESVM-RBF5-fold CV and Jackknife validation
IRSpot-Pse6NCYesS3Pse6NCBinomial distributionSVM-RBF5-fold CV
Table 1

A comprehensive list of methods for prediction recombination hotspots of S. cerevisiae

MethodsWeb serverData setFeature extractionFeature selectionAlgorithmEvaluation strategy
SVM-FCUNoS1FCUNoneSVM-RBF10-fold CV
RF-DYMHCYesS1, S2Gapped dinucleotide composition featuresNoneRF5-fold CV and Jackknife validation
IDQD-4-merNoS1, S2, S34-merNoneIDQD5-fold CV
IRSpot-PseDNCYesS1, S2, S3PseDNCNoneSVM-RBF5-fold CV and Jackknife validation
SVM-NACPseDNCNoS1Nucleic acid composition (NAC), n-tier NAC and PseDNCNoneSVM-linear kernelJackknife validation
IRSpot-TNCPseAACYesS1, S2, S3TNC and PseAACNoneSVM-RBFJackknife validation
Weighted featuresNoS1k-mer, physical propertiesNoneWeighted features5-fold CV
IRSpot-ELYesS2PseKNC, DACCF-scoreSVM-RBF5-fold CV
IDQD-Nu-occNoS14-mer, Nu-occNoneIDQD5-fold CV
IRSpot-GAEnsCNoS1TNC, DNCNoneANN + SVM + RFJackknife validation
HcsPredictorYesS1PseDNC+ZNoneLDM5-fold CV and Jackknife validation
IRSpot-ADPMNoS1, S2ADPMWrapper SVMSVM-RBF5-fold CV and Jackknife validation
IRSpot-PDINoS1, S2PDIWrapper SVMSVM-RBFJackknife validation
IRSpot-SFYesS2Gapped dinucleotide, TF-IDF, K-mer, Reverse K-merSVM-RFESVM-RBF5-fold CV and Jackknife validation
IRSpot-Pse6NCYesS3Pse6NCBinomial distributionSVM-RBF5-fold CV
MethodsWeb serverData setFeature extractionFeature selectionAlgorithmEvaluation strategy
SVM-FCUNoS1FCUNoneSVM-RBF10-fold CV
RF-DYMHCYesS1, S2Gapped dinucleotide composition featuresNoneRF5-fold CV and Jackknife validation
IDQD-4-merNoS1, S2, S34-merNoneIDQD5-fold CV
IRSpot-PseDNCYesS1, S2, S3PseDNCNoneSVM-RBF5-fold CV and Jackknife validation
SVM-NACPseDNCNoS1Nucleic acid composition (NAC), n-tier NAC and PseDNCNoneSVM-linear kernelJackknife validation
IRSpot-TNCPseAACYesS1, S2, S3TNC and PseAACNoneSVM-RBFJackknife validation
Weighted featuresNoS1k-mer, physical propertiesNoneWeighted features5-fold CV
IRSpot-ELYesS2PseKNC, DACCF-scoreSVM-RBF5-fold CV
IDQD-Nu-occNoS14-mer, Nu-occNoneIDQD5-fold CV
IRSpot-GAEnsCNoS1TNC, DNCNoneANN + SVM + RFJackknife validation
HcsPredictorYesS1PseDNC+ZNoneLDM5-fold CV and Jackknife validation
IRSpot-ADPMNoS1, S2ADPMWrapper SVMSVM-RBF5-fold CV and Jackknife validation
IRSpot-PDINoS1, S2PDIWrapper SVMSVM-RBFJackknife validation
IRSpot-SFYesS2Gapped dinucleotide, TF-IDF, K-mer, Reverse K-merSVM-RFESVM-RBF5-fold CV and Jackknife validation
IRSpot-Pse6NCYesS3Pse6NCBinomial distributionSVM-RBF5-fold CV
Table 2

The recombination hotspots data sets of S. cerevisiae

Benchmark data setData sourcesEqual lengthPositive subsetNegative subset
|$\mathrm{S}$|1[15]Gerton et al. [1]No490591
|$\mathrm{S}$|2[23]Gerton et al. [1]No478572
|$\mathrm{S}3$|[28]Gerton et al. [1]Yes490591
|$\mathrm{S}$|4(This paper)Pan et al. [10]No32123202
Benchmark data setData sourcesEqual lengthPositive subsetNegative subset
|$\mathrm{S}$|1[15]Gerton et al. [1]No490591
|$\mathrm{S}$|2[23]Gerton et al. [1]No478572
|$\mathrm{S}3$|[28]Gerton et al. [1]Yes490591
|$\mathrm{S}$|4(This paper)Pan et al. [10]No32123202
Table 2

The recombination hotspots data sets of S. cerevisiae

Benchmark data setData sourcesEqual lengthPositive subsetNegative subset
|$\mathrm{S}$|1[15]Gerton et al. [1]No490591
|$\mathrm{S}$|2[23]Gerton et al. [1]No478572
|$\mathrm{S}3$|[28]Gerton et al. [1]Yes490591
|$\mathrm{S}$|4(This paper)Pan et al. [10]No32123202
Benchmark data setData sourcesEqual lengthPositive subsetNegative subset
|$\mathrm{S}$|1[15]Gerton et al. [1]No490591
|$\mathrm{S}$|2[23]Gerton et al. [1]No478572
|$\mathrm{S}3$|[28]Gerton et al. [1]Yes490591
|$\mathrm{S}$|4(This paper)Pan et al. [10]No32123202

Despite the great progress made in the above research, there are still some problems. With the development of experimental technology, the resolution and sensitivity of genome-wide DSBs atlas have been greatly improved. However, the data set used in the previous models have not been improved. In previous data sets, due to the limitations of experimental methods, only DSBs within ORF can be found. However, the recombination hotspots are not limited to ORF, thus these models trained on such data set lack extensive application [11]. Furthermore, most of models did not pay attention to the problem of the sliding window size. They can only predict whether an input sequence contains recombination hotspots or not, but could not predict where the recombinant hotspots are located [27]. When a genome is scanned using a model, the question of how to set a window becomes more important.

To solve these problems, a new data set with high resolution and sensitivity was built to train a new predictor, in which the recombination hotspots are no longer limited to ORF [11]. The new model was examined on the chromosome XVI of S. cerevisiae. We believe that our work represents an important advancement in accelerating the discovery of recombination hotspots. We anticipate that our review will be helpful for future development of computational methods to efficiently and accurately identify recombination hotspots.

Methods and materials

The benchmark data sets

In 2000, Gerton et al. proposed the DNA microarray method to estimate the DSBs formation adjacent to each ORF in the S. cerevisiae at single-gene resolution. To estimate the DSBs formation adjacent to each ORF, they measured the hybridization ratio of a DSBs-enriched probe (P2) to a total genomic probe (P1) [1]. The relative strength of the recombination rate was estimated by the P2/P1 hybridization ratio. If the P2/P1 hybridization ratio ranked in the top 12.5% in more than five of seven experiments, an ORF was classified as ‘hot’, and if it ranked in the bottom 12.5%, it was classified as ‘cold’. Accordingly, the authors found 177 hotspots and 40 coldspots. In another study [13], based on Gerton et al.’s study, Jiang et al. reconstructed the data set and defined the relative hybridization ratio ≥1.5 as hotspots, while relative hybridization ratio <0.82 was defined as coldspots. Thus, Jiang et al. [13] obtained 490 hotspots and 591 coldspots, which form the training data set. The data set is called S1 and is available in Supplementary Material S1.

In Liu et al.’s study [22], to avoid redundancy and reduce homology deviations, sequences with similarity >75% were removed by using CD-HIT [28]. Then they obtained 478 hotspots and 572 coldspots, which formed data set S2 and is available in Supplementary Material S2.

It was found that the sequences in Supplementary Materials S1 and S2 are with different lengths, which prevented them from establishing a widely useful model because users do not know how long the length should be used for a query DNA sequence. Thus, an equal length data set was rebuilt [27]. The new data set, called S3, contains 490 recombination hotspots and 591 recombination coldspots with the length of 131 bp and is available in Supplementary Material S3.

In 2011, Pan et al. [11] uncovered a genome-wide DSBs map with unprecedented resolution and sensitivity by sequencing the DNA fragments that bound to accumulation of Spo11 protein covalently. Based on these data, we proposed a new data set. To avoid redundancy and reduce homology deviations [29], sequences with >75% similarity were removed using the CD-HIT procedure [28]. We obtained 3480 sequences with recombination hotspots and 3471 sequences without recombination hotspots in S. cerevisiae. Subsequently, we divided this new data set into two parts: a training data set containing sequences from chromosome I to XV and an independent test data set containing sequences from chromosome XVI. The training data set can be formulated as follows:
(1)
where |$\mathrm{S}$| is the benchmark data set S4, |${\mathrm{S}}^{+}$| is the positive subset containing 3212 true recombination hotspots sequences and |${\mathrm{S}}^{-}$| is the negative subset containing 3202 non-recombination hotspots sequences of S. cerevisiae in I–XV chromosomes, which are available in Supplementary Material S4. These data sets were shown in Table 2.

The independent test data set contains the remaining 268 true recombination hotspots sequences and 280 non-recombination hotspots sequences in XVI chromosomes of S. cerevisiae, which are available in Supplementary Material S5.

Feature extraction

How to formulate a biological sequence with a vector is one of the most challenging problems in computational biology [30]. This is because nearly all the existing machine learning algorithms were developed to handle vectors rather than sequence samples [31]. To capture the key information of recombination hotspots, there are two types of feature encoding algorithms that have been proposed, namely ORF-based methods and non-ORF-based methods.

ORF-based approaches

Codon use frequency

Codon use frequency (FCU) is the fusion of both codon usage bias and amino acid composition (AAC) signals [12, 32]. Each ORF was represented by a 61-dimensional vector with respect to the 61 sense codons. FCU value of the j-th codon for the i-th amino acid was calculated as follows:
(2)
where |${\mathrm{obs}}_{ij}$| is the observed number of the j-th codon for the i-th amino acid and Total is the total number of codons in the ORF.

Phase-specific PseDNC

Phase-specific PseDNC (psPseDNC) can reflect composition bias among three codon positions [21]. psPseDNC is inspired by the idea of the Z curve theory. Dong et al. [21] improved the PseDNC method to give psPseDNC. The following equations adopt a similar form to PseDNC:
(3)
where |${\theta}_{1-2,\lambda }$|⁠, |${\theta}_{2-3,\lambda }$| and |${\theta}_{3-1,\lambda }$| are phase-specific order-correlated factors and reflect the sequence-order correlation. The sequence feature vectors of each DNA can be calculated using PseDNC by incorporating into |${\theta}_{1-2,\lambda }$|⁠, |${\theta}_{2-3,\lambda }$| and |${\theta}_{3-1,\lambda }$|⁠. There are three phases in a DNA sequence, therefore, a DNA sequence is now represented by|$(16+1)\times 3$| dimensional vectors.

Amino acid composition

When the sequence is ORF, we can use the AAC to extract features [18, 33, 34]. Every three bases can encode one amino acid. AAC is the frequency of each type of amino acid occurring in peptide sequences and can be calculated as follows:
(4)

where |${N}_{\mathrm{r}}$| is the number of the amino acid type |$r$| and |$N$| is the length of the sequence.

Trinucleotide composition and pseudo amino acid composition (TNCPseAAC)

TNCPseAAC is a feature extraction technique by combining trinucleotide composition [35] and pseudo amino acid composition (PseAAC) [36]. For a given gene sequence, the TNCPseAAC feature vector is represented by the following:
(5)
where
(6)
where |$\omega$| is the weight factor,|${f}_u$|  (1|$\le u\le 64$|⁠) represents the DNA sequence using the occurrence frequencies of its 64 trinucleotides that reflect short-range sequence-order effect, while|${\theta}_{u-64}$| reflected via the PseAAC of its translated protein chain. |${\theta}_j$|is subject to the following formula:
(7)
where |${\theta}_j(j=1,2,3,\dots, \lambda )$|is called the j-th tier correlation factor that reflects the sequence-order correlation between all the j-th most contiguous residues along a protein chain. In this study, the correlation function is given by the following:
(8)

where |${H}_n({A}_j)\ (n=1,2,\dots, 6)\ \mathrm{is}$| the standardized values of six physicochemical properties of amino acids. They are hydrophobicity, hydrophilicity, side-chain mass, pK1 (α-COOH), pK2 (NH3) and PI and can be obtained from Ref [18].

Non-ORF-based methods

Gapped dinucleotide composition features

The gapped dinucleotide composition uses gaps with k intervening bases between every dinucleotide within a sequence [13]. It encapsulates the composition and local order information of any two interval dinucleotides within a gene sequence rather than the adjacent dinucleotide composition. It can be defined as follows:
(9)
where |${o}_{(k)}^i$| is the total number of the i-th dinucleotide with k intervening bases and |${n}_{(k)}$| is the total number of all possible dinucleotide with k intervening bases. In this study, each sequence will be represented by a (⁠|${4}^2\times 2$|⁠)-dimension feature vector reflecting the dinucleotide compositions with the interval of k = 0 and k = 1.

K-mer composition

The frequency of k-mer is the ratio of the number of occurrences of the i-th k-mer in the gene sequence to the number of occurrences of all k-mer in this sequence through a window of width k sliding across the gene sequence with a step of 1 [17, 20]. We calculated the frequency of k-mer in the gene sequence according to the following equation:
(10)
where |${n}_i$| denote the number of the i-th k-mer and L denotes the length of the gene sequence, respectively. When k = 2, we call this method Dinucleotide composition (DNC), and when k = 3, we call this method Trinucleotide composition (TNC). In this study, k can take the integer ranging from 1 to 8. Accordingly, each sequence can be represented as a 4k dimensional vector.

Reverse complement composition of k-mer

The reverse complement composition of k-mer is the normalized count of k-mer and its reverse complements present in the DNA sequence [26, 37]. The difference of this feature with k-mer composition is that the reverse complement composition of k-mer counts all the occurrences of the k-mer and its reverse complement.

Term frequency–inverse document frequency of k-mer

Term frequency (TF) is the normalized frequency or composition of a string in a given sequence [26]. However, inverse document frequency (IDF) searches for k-mer that is sparsely distributed in the sequences. It is formally defined as below:
(11)
where |$\mid S\mid$| is the number of sequences in the data set S and |$d({s}_i,D)$| is a function that returns 1 if and only if the k-mer is present in the sequence D, and otherwise returns 0. TF-IDF is the multiplication of TF and IDF for each of the k-mer.

Pseudo K-tuple nucleotide composition

PseKNC method not only reflects the local sequence information of the nucleotide sequence but also describes the global sequence information of the nucleotide sequence [16, 17]. For a given gene sequence, the PseKNC feature vector is represented by:
(12)
where
(13)
where u is an integer, |${f}_u$|  (1|$\le u\le {4}^k$|⁠) represents the normalized occurrence frequency of k-mer, |$\lambda$| is the number of the total counted ranks (or tiers) of the correlations along a DNA sequence and |$\omega$| is the weight factor. |${\theta}_j$|  |$(j=1,2,3,\dots, \lambda )$| is the correlation function that measures the j-tier sequence-order correlation between all the j-th most contiguous residues along sequence. |${\theta}_j$|is defined as below:
(14)
The parameter |$\lambda$| is an integer where|${\theta}_{\lambda }$| is called the |$\lambda$|-tier correlation factor that reflects the sequence-order correlation between all the most contiguous nucleotides along a DNA sequence and the correlation function is given by the following:
(15)
where |$\mu$| is the number of local DNA structural properties considered. |${P}_u({R}_i{R}_{i+1}\dots {R}_{i+k-1})$|⁠, the numerical value of the u-th (u = 1,2, …,|$\mu$|⁠) DNA local property for the oligonucleotides |${R}_i{R}_{i+1}\dots {R}_{i+k-1}$| at position i and |${P}_u({R}_{i+j}{R}_{i+j+1}\dots {R}_{i+j+k-1})$|⁠, the corresponding value for the oligonucleotides |${R}_{i+j}{R}_{i+j+1}\dots {R}_{i+j+k-1}$| at position i + j. They are calculated by the following formula:
(16)
where < > represents the mean and SD represents the standard deviation. For the oligonucleotides, the original physicochemical property values can be obtained from [15]. PseDNC is a special case of k = 2 in PseKNC. k is the number of neighbouring nucleic acid residues.

Weighted features

To quantify the importance of different features in prediction and incorporate different kinds of features in a proper way, Liu et al. [20] weighted the features using a variance-based feature selection method. First, samples were represented by possibly important features, such as k-mer distribution in sequences and sequence-order-related information. Second, weights of different features were calculated according to their ability to distinguish positive samples from negative samples. Third, values of the features were normalized via a standard conversion. Finally, samples were represented by weighted features (WF) by multiplying the normalized features with their corresponding weights.

Physical properties of DNA sequences

Due to nucleosome occupancy displaying an important role in the DSBs, the influence of nucleosome occupancy on recombination spots prediction should be investigated. Fifteen physical properties of DNA sequences (Supplementary Material S6) were used to describe nucleosome and have been used in recombination spots prediction [20]. Here, we used the physical properties as features to identify recombination hotspots by the following formula:
(25)
where |${P}_m(i,i+1)$| and |${P}_m(i+n,i+n+1)$| denote the values of the m-th property listed in Supplementary Material S6 at dinucleotide position |$(i,i+1)$| and |$(i+n,i+n+1)$|⁠, respectively, and |$L$| is sequence length. |${h}_{n,m}$| is called the n-th tier correlation factor that reflects the sequence-order correlation with respect to the m-th parameter between all the n-th most contiguous dinucleotide along a DNA sequence.

Associated dinucleotide product model (ADPM)

ADPM is an algorithm that uses an associated dinucleotide product model to define Chou’s pseudo components [24].

For a given DNA sequence D, the ADPM feature vector is represented by:
(17)
where
(18)
where |$k,t$| is the number of physical and chemical properties used in this study and |$k,t=1,2,\dots, 15,\kern0.5em g$| is an accumulated distance factor that determines the degree of separation between dinucleotide along the sequence. |${\varTheta}_{i,g,k,t}$| is called product operator, which is defined by the following mathematical expression:
(19)

|${\varTheta}_{i,g,k,t}$|reflects the associated influence of dinucleotide pairs at different positions with genomic evolutionary. |${P}_{i,k}$| denotes the k-th property value of the i-th dinucleotide pair, which is composed of two adjacent nucleotides in the sequence. The given sequence can be converted into a feature vector with a uniform size of 15 × 15 ×|$g$|⁠.

Property diversity information (PDI)

PDI is an improved method of ADPM [25]. PDI reflects the property differentiation between the dinucleotide pairs in |$g$| interval along the sequence and the forward and backward order information. For a given DNA sequence D, the PDI feature vector is represented by the following:
(20)
where
(21)
The parameter in the formula is the same as (18). |${\varTheta}_{i,g,k,t}$| called diversity function is defined by the following mathematical equations:
(22)
 
(23)
 
(24)
where |${\theta}_1$| describes the differentiation between the i-th and the (i + g)-th dinucleotide according to properties t and k. Similarly, |${\theta}_2$|expresses the differentiation between the i-th and the (i-g)-th dinucleotide according to properties t and k.|${\varTheta}_{i,g,k,t}$| reflects the diversity information of dinucleotide pairs on different DNA properties with the distance g.

Dinucleotide-based auto-cross covariance

Dinucleotide-based auto-cross covariance (DACC) approach is a very special PseKNC mode, which is a combination of dinucleotide-based auto covariance and dinucleotide-based cross covariance [20].

Feature selection methods

As the dimension of the feature vector rises rapidly, we consider three issues: one is the large feature set that contains some redundant or irrelevant information; another is over-fitting, which results in low generalization ability of prediction model; the other is causing the curse of dimensionality and dyscalculia. It is necessary to employ feature selection process to improve these problems [38–40]. Three main directions have been developed for feature selection—filter, wrapper and embedded. Filters rely on the general characteristics of training data and carry out the feature selection process as a pre-processing step with independence of the induction algorithm. Wrappers involve optimizing a predictor as a part of the selection process. Embedded methods perform feature selection in the process of training and are usually specific to given learning machines. In our study, some works also use these three feature selection methods.

Binomial distribution

We should judge whether the occurrence of a particular feature is a stochastic event or not. According to the statistical theory, the binomial distribution was used to measure the occurrence probabilities of all features in positive samples and negative samples [27]. When the P value of a feature is small, the occurrence of the feature may not be a random event [41]. In this study, according to the statistical theory, the binomial distribution was used to measure the confidence of features. Then we use the incremental feature selection (IFS) to find out the best feature subset that could produce the maximum accuracy.

F-score

F-score is based on the idea that the differences within classes should be less than the differences between classes [42]. The larger the F-score value, the greater the difference between this feature and other features. It also indicates that the feature has stronger discrimination and greater effect on the prediction result. Specific implementation methods can be seen in a recent study [20].

Wrapper SVM

In this study, a wrapper feature selection process with SVM classifier is performed. Specifically, sequential forward selection strategy is adopted, in which features are sequentially added one by one to an empty candidate set until the addition of further features does not increase the criterion [25]. The criterion is defined by the overall accuracy obtained from SVM with n-fold cross-validation (CV). Finally, an optimal feature subset was constructed.

Recursive feature elimination for support vector machines

Recursive Feature Elimination for Support Vector Machines (SVM-RFE) was applied to select a group of important features [43]. This embedded method performs feature selection by iteratively training an SVM classifier with the current set of features and removing the least important feature indicated by the SVM [17, 26].

Model training

Support vector machine

SVM is a type of supervised machine learning algorithm. It has been successfully applied in the field of bioinformatics [44–50]. The basic idea of SVM is to transform the data into a high-dimensional feature space and then determine the optimal separating hyperplane. The choice of SVM kernel function plays a crucial role in the performance. Commonly used kernel functions are linear kernel function, polynomial kernel function, Radial Basis Function (RBF) kernel functions and sigmoid kernel function. Some software packages have been developed for the implication of SVM, such as LIBSVM, mySVM and SVMLight [51, 52]. In the SVM operation engine, the grid search method was applied to optimize the regularization parameter C and kernel parameter γ in the following ranges:
(25)

Random forest

The RF algorithm is also a popular machine learning algorithm and has been successfully employed in dealing with various biological prediction problems [53–55]. RF is a classifier that contains multiple decision trees; it is adopted bagging and random feature selection strategy. In bagging, each tree is trained on a bootstrap sample of the training data, and predictions are made by the majority votes of trees. RF is a further development of bagging. Instead of using all features, RF randomly selects a subset of features to split at each node when growing a tree. To assess the prediction performance of the algorithm, RF performs a type of cross-validation in parallel with the training step by using the so-called out-of-bag samples [56]. The RF algorithm was implemented by the Random Forest R package [13].

Recently, Ke et al. [57] proposed a novel RF algorithm named LightGBM, which uses a novel gradient boosting decision tree algorithm, including gradient-based one-side sampling to extract relatively small number of samples according to gradient values and exclusive feature bundling to reduce the number of features.

Increment of diversity

In bioinformatics, the increment of diversity (ID) method is widely applied to the prediction of sequences, structures and functions [58, 59]. Given the two diversity sources |$X=\{{x}_1,{x}_2,\dots, {x}_n\}$| and |$Y=\{{y}_1,{y}_2,\dots, {y}_n\}$|⁠, ID is defined as follows:
(26)
 
(27)
 
(28)
where |$N={\sum}_{i=1}^d{n}_i,M={\sum}_{i=1}^d{m}_i,{n}_i$| and |${m}_i$| indicate the absolute frequency of the i-th state in diversity source X and Y, respectively. |$D(X+Y)$| is the diversity measure of the mixed source |$X+Y$|⁠:|$\{{x}_1+{y}_1,{x}_2+{y}_2,\dots, {x}_n+{y}_n\}$|⁠. Diversity measure is a description of total uncertainty of information in a state space, and the ID quantitatively describes the similarity level of two diversity sources. The smaller the ID, the higher the similarity between the two corresponding diversity sources.

Quadratic discriminant

In Liu’s study [60], a quadratic discriminant (QD) function based on Mahalanobis distance was used for prediction of recombination hotspots [60]. Mahalanobis distance is based on correlations between variables by which different patterns can be identified and analysed. It is a useful measure of difference between an unknown sample set and a known one. The combination of ID and QD algorithms led to a powerful classifier called IDQD.

Large margin distribution machine

The margin distribution has a crucial influence on the performance of classifiers. Large margin distribution machine (LDM), which is inspired by the generalization performance can be improved by optimizing the margin distribution through maximizing the margin mean and minimizing the margin variance simultaneously [21, 61]. Using LDM to implement classification generally requires two steps: first, mapping the feature vector to the high-dimensional space; second, using the hyperplane to maximize the margin mean while minimizing the margin variance, thereby conveniently separating the samples. There are two solvers in the LDM: the dual coordinate descent method and the average stochastic gradient descent method. In our study, we use the dual coordinate descent method.

Artificial neural network

Artificial neural network (ANN) abstracts the human brain neuron network from the perspective of information processing, establishes a simple model and forms different networks according to different connection methods. In the past decade or so, the research work of ANNs has been deepened and great progress has been made. It has been successfully applied in the field of bioinformatics [62, 63]. In this research [23], a variety of ANN methods have also been applied, such as probabilistic neural network, generalize regression neural network, feed forward neural network, fitting network and pattern recognition network.

K-nearest neighbour (KNN)

KNN is a widely used algorithm in the field of pattern recognition [23]. The flexibility of KNN is of great advantage for gene function classification problems, wherein the class boundaries are inherently vague, and many classes cannot be categorized by a simple model. The main idea of the classical KNN is the following: design a set of numerical features to describe each data point and select a metric, e.g. Euclidean distance, to measure the similarity of data points based on all features. Then for a target point, find its K closest points in the training samples based on the similarity metric and assign it to a class by the majority votes of its neighbours.

Table 3

Comparison with state-of-the-art predictors on data sets S1

Methods5-fold cross-validationJackknife validation
Sn(%)Sp(%)Acc(%)MCCSn(%)Sp(%)Acc(%)MCC
RF-DYMHC80.5984.2682.050.638
IDQD-4-mer79.4081.0080.300.603
IRSpot-PseDNC81.6388.1485.190.69273.0689.4982.040.638
SVM-NACPseDNC76.1290.6984.090.680
IRSpot-TNCPseAAC84.1779.5983.720.671
Weighted-Features86.1094.3090.700.812
IDQD-Nu-occ80.0082.9081.600.629
IRSpot-GAEnsC80.0888.0784.460.690
HcsPredictor77.8289.7884.360.68578.7890.6985.290.704
ANN-AAC
IRSpot-ADPM75.5190.5283.720.673
IRSpot-PDI71.6393.9183.810.681
Methods5-fold cross-validationJackknife validation
Sn(%)Sp(%)Acc(%)MCCSn(%)Sp(%)Acc(%)MCC
RF-DYMHC80.5984.2682.050.638
IDQD-4-mer79.4081.0080.300.603
IRSpot-PseDNC81.6388.1485.190.69273.0689.4982.040.638
SVM-NACPseDNC76.1290.6984.090.680
IRSpot-TNCPseAAC84.1779.5983.720.671
Weighted-Features86.1094.3090.700.812
IDQD-Nu-occ80.0082.9081.600.629
IRSpot-GAEnsC80.0888.0784.460.690
HcsPredictor77.8289.7884.360.68578.7890.6985.290.704
ANN-AAC
IRSpot-ADPM75.5190.5283.720.673
IRSpot-PDI71.6393.9183.810.681
Table 3

Comparison with state-of-the-art predictors on data sets S1

Methods5-fold cross-validationJackknife validation
Sn(%)Sp(%)Acc(%)MCCSn(%)Sp(%)Acc(%)MCC
RF-DYMHC80.5984.2682.050.638
IDQD-4-mer79.4081.0080.300.603
IRSpot-PseDNC81.6388.1485.190.69273.0689.4982.040.638
SVM-NACPseDNC76.1290.6984.090.680
IRSpot-TNCPseAAC84.1779.5983.720.671
Weighted-Features86.1094.3090.700.812
IDQD-Nu-occ80.0082.9081.600.629
IRSpot-GAEnsC80.0888.0784.460.690
HcsPredictor77.8289.7884.360.68578.7890.6985.290.704
ANN-AAC
IRSpot-ADPM75.5190.5283.720.673
IRSpot-PDI71.6393.9183.810.681
Methods5-fold cross-validationJackknife validation
Sn(%)Sp(%)Acc(%)MCCSn(%)Sp(%)Acc(%)MCC
RF-DYMHC80.5984.2682.050.638
IDQD-4-mer79.4081.0080.300.603
IRSpot-PseDNC81.6388.1485.190.69273.0689.4982.040.638
SVM-NACPseDNC76.1290.6984.090.680
IRSpot-TNCPseAAC84.1779.5983.720.671
Weighted-Features86.1094.3090.700.812
IDQD-Nu-occ80.0082.9081.600.629
IRSpot-GAEnsC80.0888.0784.460.690
HcsPredictor77.8289.7884.360.68578.7890.6985.290.704
ANN-AAC
IRSpot-ADPM75.5190.5283.720.673
IRSpot-PDI71.6393.9183.810.681

Performance evaluation

In order to measure the prediction quality, the following set of four metrics was used: sensitivity (Sn), specificity (Sp), overall accuracy (Acc) and Matthews correlation coefficient (MCC) [64–69].

Sn and Sp indicate the ability to correctly identify recombination hotspot and non-recombination hotspot, respectively. Acc is the overall accuracy describing the discrimination between recombination hotspot and non-recombination hotspot. MCC can be intuitive to measure a binary classification problem. These measures are defined as follows:
(29)
where |${N}^{+}$| is the total number of the positive samples investigated, while |${N}_{-}^{+}$| is the number of positive samples incorrectly predicted as the negative samples; |${N}^{-}$|is the total number of the negative samples sequence investigated, while |${N}_{+}^{-}$|is the number of the negative samples incorrectly predicted as the positive samples.

Receiver operating characteristic curve

The receiver operating characteristic (ROC) curve [70–74] was also used to measure the prediction power of the current method, where its vertical coordinate is for the true positive rate (⁠|$\mathrm{Sensitivity}$|⁠) and the horizontal coordinate for the false positive rate (⁠|$1-\mathrm{Specificity}$|⁠). The area under the curve of ROC (AUC) of 0.5 is equivalent to random prediction, while a value of 1 represents a perfect one.

Table 4

Comparison with state-of-the-art predictors on data sets S2

Methods5-fold cross-validationJackknife cross-validation
Sn(%)Sp(%)Acc(%)MCCSn(%)Sp(%)Acc(%)MCC
iRSpot-EL75.2988.8182.650.651
RF-DYMHC73.0186.5680.400.605
IDQD-4-mer79.5281.8280.770.616
IRSpot-PseDNC71.7585.8479.330.583
IRSpot-TNCPseAAC76.5670.9973.520.474
IRSpot-ADPM77.1990.7384.570.69076.3690.5684.100.681
IRSpot-PDI71.8492.9583.160.666
iRSpot-SF84.5775.7684.580.69475.3191.0883.910.677
Methods5-fold cross-validationJackknife cross-validation
Sn(%)Sp(%)Acc(%)MCCSn(%)Sp(%)Acc(%)MCC
iRSpot-EL75.2988.8182.650.651
RF-DYMHC73.0186.5680.400.605
IDQD-4-mer79.5281.8280.770.616
IRSpot-PseDNC71.7585.8479.330.583
IRSpot-TNCPseAAC76.5670.9973.520.474
IRSpot-ADPM77.1990.7384.570.69076.3690.5684.100.681
IRSpot-PDI71.8492.9583.160.666
iRSpot-SF84.5775.7684.580.69475.3191.0883.910.677
Table 4

Comparison with state-of-the-art predictors on data sets S2

Methods5-fold cross-validationJackknife cross-validation
Sn(%)Sp(%)Acc(%)MCCSn(%)Sp(%)Acc(%)MCC
iRSpot-EL75.2988.8182.650.651
RF-DYMHC73.0186.5680.400.605
IDQD-4-mer79.5281.8280.770.616
IRSpot-PseDNC71.7585.8479.330.583
IRSpot-TNCPseAAC76.5670.9973.520.474
IRSpot-ADPM77.1990.7384.570.69076.3690.5684.100.681
IRSpot-PDI71.8492.9583.160.666
iRSpot-SF84.5775.7684.580.69475.3191.0883.910.677
Methods5-fold cross-validationJackknife cross-validation
Sn(%)Sp(%)Acc(%)MCCSn(%)Sp(%)Acc(%)MCC
iRSpot-EL75.2988.8182.650.651
RF-DYMHC73.0186.5680.400.605
IDQD-4-mer79.5281.8280.770.616
IRSpot-PseDNC71.7585.8479.330.583
IRSpot-TNCPseAAC76.5670.9973.520.474
IRSpot-ADPM77.1990.7384.570.69076.3690.5684.100.681
IRSpot-PDI71.8492.9583.160.666
iRSpot-SF84.5775.7684.580.69475.3191.0883.910.677

Model validation method

Cross-validation is a commonly used statistical analysis method that can be used to objectively evaluate the performance of a classification model [75, 76]. There are three commonly used cross-validation methods, namely independent data set test, n-fold cross-validation test and jackknife cross-validation test. To objectively evaluate the performance of different models, we tested them on the independent test data sets in this review.

Results and discussion

Comparison with state-of-the-art predictors on four data sets

Considering the fact that the existing methods were trained and tested based on different data, we evaluated them on data sets S1, S2, S3 and S4, respectively.

We firstly evaluated the 12 predictors that trained based on data set S1 and listed results in Table 3. Their results are based on the 5-fold cross-validation test and jackknife test. It can be seen that the model WF performs the best in 5-fold cross-validation test and got the best Sn/Sp/Acc/Mcc. WF is developed by using QD analysis based on WF. This result demonstrates that weighting the importance of the feature is effective for improving the accuracy of the model. In the jackknife test, the best result is obtained by the model of HcsPredictor. We also noticed that the model iRSpot-GAEnsC that integrates seven algorithms did not improve the prediction accuracy of the model. Therefore, we should cautiously choose the integration approach.

Eight predictors are based on the data set S2. As shown in Table 4, iRSpot-ADPM is the best predictor in 5-fold cross-validation. IRSpot-ADPM used the weights calculated by the SVM method to decrease features from 1575 to 85. After that, support vector machine-Gaussian radial basis function (SVM-RBF) is verified by 5-fold cross-validation and obtained an accuracy of 84.57%. Compared with the accuracy of 83.14% obtained by the iRSpot-ADPM1575, iRSpot-ADPM employed fewer features but obtained higher accuracy. So it can be understood that feature filtering is very helpful for building a model with a subset of non-redundant features.

Data set S3 is an isometric data set based on data set S1. Data set S3 can easily form a window to scan the entire sequence. This kind of predictor is more applicable, which is very important for discovering new recombination hotspots. As shown in Table 5, based on data set S3, the predictor iRSpot-Pse6NC achieved the best results.

Table 5

Comparison with state-of-the-art predictors on data sets S3 in 5-fold cross-validation

MethodsSn(%)Sp(%)Acc(%)MCC
IRSpot-Pse6NC75.7191.0384.080.6805
IRSpot-PseDNC62.3490.5277.920.5585
IRSpot-TNCPseAAC61.0289.5176.600.5334
IDQD-4-mer69.5975.0972.590.4469
MethodsSn(%)Sp(%)Acc(%)MCC
IRSpot-Pse6NC75.7191.0384.080.6805
IRSpot-PseDNC62.3490.5277.920.5585
IRSpot-TNCPseAAC61.0289.5176.600.5334
IDQD-4-mer69.5975.0972.590.4469
Table 5

Comparison with state-of-the-art predictors on data sets S3 in 5-fold cross-validation

MethodsSn(%)Sp(%)Acc(%)MCC
IRSpot-Pse6NC75.7191.0384.080.6805
IRSpot-PseDNC62.3490.5277.920.5585
IRSpot-TNCPseAAC61.0289.5176.600.5334
IDQD-4-mer69.5975.0972.590.4469
MethodsSn(%)Sp(%)Acc(%)MCC
IRSpot-Pse6NC75.7191.0384.080.6805
IRSpot-PseDNC62.3490.5277.920.5585
IRSpot-TNCPseAAC61.0289.5176.600.5334
IDQD-4-mer69.5975.0972.590.4469

Based on data set S4 containing both ORF and non-ORF recombination hotspots, we construct a predictor called iRSpot-Pse6NC2.0 by incorporating the key hexamer features into the general PseKNC via the binomial distribution feature selection approach. We sorted the features according to the scores obtained from binomial distribution. By using overall accuracy as vertical coordinates and feature numbers as horizontal coordinates, we plotted IFS curve as shown in Figure 1. It was noticed that the peak of the curve is 77.61%, which is located at horizontal coordinate of 1000. This result (77.61%) is higher than that (74.85%) by using all features. Accordingly, the 1000 hexamers were selected to form the optimal feature subset to train the prediction model.

IFS curve for predicting recombination hotspots based on the 5-fold cross-validation. An IFS peak of 77.61% was observed when using the top 1000 hexamers to perform prediction
Figure 1

IFS curve for predicting recombination hotspots based on the 5-fold cross-validation. An IFS peak of 77.61% was observed when using the top 1000 hexamers to perform prediction

To further investigate the performance of the iRSpot-Pse6NC2.0, we drew the ROC curve in Figure 2. It shows that the AUC reaches a value of 0.833, indicating that the proposed method is quite promising and holds very high potential to become a useful high-throughput tool for predicting recombination hotspots.

The ROC curve for identifying recombination hotspots by using 1000 optimal hexamers. The AUC of 0.833 was obtained in 5-fold cross-validation. The diagonal dot line denotes a random guess with the AUC of 0.5
Figure 2

The ROC curve for identifying recombination hotspots by using 1000 optimal hexamers. The AUC of 0.833 was obtained in 5-fold cross-validation. The diagonal dot line denotes a random guess with the AUC of 0.5

Comparison with physical properties

In S. cerevisiae, both proteins Set1 and Spo11 affect the formation of DSBs through specific binding of DNA sequences [77, 78]. As a sequence feature, the hexamer can reflect the influence of these two proteins on the formation of DSBs. Due to nucleosome occupancy displaying an important role in the DSBs, we further investigated the influence of nucleosome occupancy on recombination spots prediction by adding feature of physical properties of nucleosome [20]. As it can be seen from Table 6, the nucleosome feature could produce the accuracy of 74.74%, demonstrating that the nucleosomes could influence the recombination. However, by combining the nucleosome feature with optimal hexamer composition, the accuracy reduced to 74.20%, suggesting that there is information redundancy. In fact, many studies have shown that hexamer composition is an important feature that could describe DNA motifs and could be utilized to identify a variety of DNA regulatory elements [77, 79]. Our results exactly reflected this property. The maximum accuracy of 77.61% was obtained by the optimal hexamers.

Table 6

Comparison with physical properties on data set S4

FeatureAlgorithm5-fold cross-validation
Sn(%)Sp(%)Acc(%)
Optimal hexamersSVM75.2478.0177.611
Optimal hexamersLightGBM77.2475.8076.531
Physical propertiesSVM72.9174.8974.742
Optimal hexamers & Physical propertiesSVM73.5974.7974.197
FeatureAlgorithm5-fold cross-validation
Sn(%)Sp(%)Acc(%)
Optimal hexamersSVM75.2478.0177.611
Optimal hexamersLightGBM77.2475.8076.531
Physical propertiesSVM72.9174.8974.742
Optimal hexamers & Physical propertiesSVM73.5974.7974.197
Table 6

Comparison with physical properties on data set S4

FeatureAlgorithm5-fold cross-validation
Sn(%)Sp(%)Acc(%)
Optimal hexamersSVM75.2478.0177.611
Optimal hexamersLightGBM77.2475.8076.531
Physical propertiesSVM72.9174.8974.742
Optimal hexamers & Physical propertiesSVM73.5974.7974.197
FeatureAlgorithm5-fold cross-validation
Sn(%)Sp(%)Acc(%)
Optimal hexamersSVM75.2478.0177.611
Optimal hexamersLightGBM77.2475.8076.531
Physical propertiesSVM72.9174.8974.742
Optimal hexamers & Physical propertiesSVM73.5974.7974.197

We also analysed the top 50 hexamer features that had the greatest impact on the results and found that the proportion of AT-content in these hexamers were relatively high. In S. cerevisiae, although recombination hotspots and GC-content at longer ranges were positively correlated, studies have found that many recombination spots are mostly in intergenic regions, which tend to be more at AT-rich than their surrounding regions [11].

Availability of online web server

For the convenience of the vast majority of experimental scientists, an online service is usually built. Users only need to submit the data in FASTA format to complete the prediction. Among the 15 existing predictors, only seven of them provide the online service. After checking them, as shown in Table 7, we found that only four web servers can be used at present.

Performance comparison on the yeast chromosome XVI

If the input sequence is too long, there will be two problems. One is that it cannot achieve a good prediction effect; the other is that it cannot predict whether the sequence contains multiple recombinant hotspots. In order to solve this problem, some predictors introduce sliding window method, which is to use sliding window scanning the input sequences and then predict the sub-sequence. There are two kinds of methods for the implementation of sliding window. One is iRSpot-Pse6NC, where in the model is built from the equal-length data set. When the prediction is made, the input sequence is divided into sub-sequences with the same length as that in the training data set. Another is iRSpot-EL. In the predictor, the input sequence is segmented according to the length of the window set by the user, and then the generated sub-sequence is predicted. Both methods have their own advantages. iRSpot-Pse6NC determines the most appropriate window length, while iRSpot-EL retains the complete experimental data as the training set. In newly constructed predictor iRSpot-Pse6NC2.0, we combined the above two advantages, retained complete experimental data to construct the training set and took the average length of sequence as the length of sliding window (189 bp).

Then, we used the yeast chromosome XVI to compare the performance of the predictors. We submitted the yeast chromosome XVI in FASTA format to the web servers listed in Table 7 and iRSpot-Pse6NC2.0 to obtain corresponding prediction results.

The predictor iRSpot-PseDNC and iRSpot-TNCPseAAC could not predict chromosome XVI directly, because the whole chromosome sequence is too long to be predicted. The predictor iRSpot-EL cannot be implemented according to the preset parameters. The predictor iRSpot-pse6NC segmented the chromosome according to the window length of 131 bp to generate sub-sequence and then obtained the results of sub-sequence. iRSpot-Pse6NC2.0 segmented the chromosome in a window length of 189 bp with the step of 189 bp. Then, we analysed the prediction results of iRSpot-Pse6NC and iRSpot-Pse6NC2.0. As shown in Figure 3, the average result of iRSpot-Pse6NC2.0 (Figure 3A) is higher than iRSpot-Pse6NC(Figure 3B), indicating that it is important to accurately predict the recombination hotspot region to retain the complete experimental data as the training set.

The prediction accuracy of (A) iRSpot-Pse6NC2.0 and (B) iRSpot-Pse6NC on the chromosome XVI.
Figure 3

The prediction accuracy of (A) iRSpot-Pse6NC2.0 and (B) iRSpot-Pse6NC on the chromosome XVI.

In order to provide a detail analysis, we sorted the prediction probability of the 268 recombination hotspots that have been experimentally confirmed in a descending order. Then we used each prediction probability values as a threshold and calculated the number of sub-sequences with an accuracy greater than the threshold; the detailed results are shown in Figure 4. From Figures 3A and 4A, 220 experimentally verified recombination hotspots were correctly identified with an accuracy of 82.10% (220/268) and 1111 additional predictions were obtained by iRSpot-Pse6NC2.0. As indicated in Figures 3B and 4B, 109 experimentally verified recombination hotspots were correctly identified with an accuracy of 40.67% (109/268) and 1084 additional predictions were obtained by iRSpot-Pse6NC. These results demonstrate that the iRSpot-Pse6NC2.0 is more credible, reliable and robust for finding potential recombination hotspots.

Prediction results of (A) iRSpot-Pse6NC2.0 and (B) iRSpot-Pse6NC on the chromosome XVI.
Figure 4

Prediction results of (A) iRSpot-Pse6NC2.0 and (B) iRSpot-Pse6NC on the chromosome XVI.

Performance analysis on the independent data set

We further compared the prediction performance of our proposed iRSpot-Pse6NC2.0 with that of iRSpot-PseDNC and iRSpot-TNCPseAAC based on the independent data set. The comparative results were listed in Table 8. The accuracy of iRSpot-PseDNC and iRSpot-TNCPseAAC was 56.16% and 54.58%,respectively, which is lower than 77.43% obtained by iRSpot-Pse6NC2.0, indicating that iRSpot-Pse6NC2.0 is better than iRSpot-TNCPseAAC and iRSpotPseDNC. This may be due to the fact that iRSpot-TNCPseAAC and iRSpotPseDNC were trained based on ORF data, while the data set of iRSpot-Pse6NC2.0 were constructed based on the data derived from the entire chromosome. The hexamer feature used in iRSpot-Pse6NC2.0 was obtained from DNA sequences directly. Thus, the model is more flexible.

Table 8

Comparison with state-of-the-art predictors on independent data set S4

MethodsSn(%)Sp(%)Acc(%)
iRSpot-PseDNC76.8635.4456.16
iRSpot-TNCPseAAC56.3452.6154.48
iRSpot-Pse6NC2.077.2377.6177.43
MethodsSn(%)Sp(%)Acc(%)
iRSpot-PseDNC76.8635.4456.16
iRSpot-TNCPseAAC56.3452.6154.48
iRSpot-Pse6NC2.077.2377.6177.43
Table 8

Comparison with state-of-the-art predictors on independent data set S4

MethodsSn(%)Sp(%)Acc(%)
iRSpot-PseDNC76.8635.4456.16
iRSpot-TNCPseAAC56.3452.6154.48
iRSpot-Pse6NC2.077.2377.6177.43
MethodsSn(%)Sp(%)Acc(%)
iRSpot-PseDNC76.8635.4456.16
iRSpot-TNCPseAAC56.3452.6154.48
iRSpot-Pse6NC2.077.2377.6177.43

Conclusion

Meiotic recombination has important roles in genome diversity and evolution, which is one of the most important driving forces of evolution. Investigations on recombination events and especially identification of recombination hotspots are significant for understanding the mechanism of recombination initiation. Since the drawbacks of experimental methods, computational methods will assist to identify recombination hotspots in a more efficient way. In this work, we have introduced and comprehensively evaluated the currently available tools for the prediction of recombination hotspots in S. cerevisiae. These computational methods were discussed and compared in terms of underlying algorithms, extracted features, predictive capability and practical utility. Subsequently, a new predictor called iRSpot-Pse6NC2.0 was constructed based on a more objective benchmark data set. To obtain a more objective performance evaluation, we constructed an independent test data set to benchmark all tools. The results demonstrated that the new predictor iRSpot-Pse6NC2.0 is superior to existing tools in the identification of recombination hotspots. We anticipate that iRSpot-Pse6NC2.0 available at http://lin-group.cn/server/iRSpot-Pse6NC2.0 will become a powerful tool for identifying recombination hotspots.

It should be noted that the proposed model iRSpot-Pse6NC2.0 was constructed based on a single-cell organism (S. cerevisiae) data. Due to species specificity, it may not be suitable for higher eukaryotes. However, the model can be used in other fungal genomes. In fact, species specificity is an important factor affecting the prediction performance of proposed models, so more and more bioinformatics models have been constructed for species specificity prediction [80, 81]. In the future, with the accumulation of higher eukaryotes data, we will update our model, which will be suitable for more species.

Key Points

  • A total of 15 computational methods developed for identifying recombination hotspots were comprehensively summarized and discussed.

  • Several comparisons were performed to investigate the predictive capability of published methods.

  • A high-quality data set was built to train and test a new prediction model for identifying recombination hotspots.

  • A user-friendly web server was developed to recognize recombination hotspots.

Funding

This work was supported by the National Nature Scientific Foundation of China (61772119, 31771471, 61861036) and the Science Strength Promotion Programme of UESTC.

Hui Yang is a PhD candidate at the Center for Informational Biology, University of Electronic Science and Technology of China. Her research interests are bioinformatics, machine learning, RNA modification site and recombination spots.

Wuritu Yang is an associate professor at the State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, Inner Mongolia University. His research is in the areas of bioinformatics.

Fu-Ying Dao is a PhD candidate at the Center for Informational Biology, University of Electronic Science and Technology of China. Her research interests are bioinformatics, machine learning and DNA replication regulation.

Hao Lv is a PhD candidate at the Center for Informational Biology, University of Electronic Science and Technology of China. Her research interests are bioinformatics, machine learning and DNA and RNA modification site.

Hui Ding is an associate professor at the Center for Informational Biology, University of Electronic Science and Technology of China. Her research is in the areas of computational system biology.

Wei Chen is a professor at the Innovative Institute of Chinese Medicine and Pharmacy, Chengdu University of Traditional Chinese Medicine. His research is in the areas of bioinformatics, computational epigenetics and epitranscriptome.

Hao Lin is a professor at the Center for Informational Biology, University of Electronic Science and Technology of China. His research is in the areas of bioinformatics and system biology.

References

1.

Gerton
 
JL
,
DeRisi
 
J
,
Shroff
 
R
, et al.  
Global mapping of meiotic recombination hotspots and coldspots in the yeast Saccharomyces cerevisiae
.
Proc Natl Acad Sci U S A
 
2000
;
97
:
11383
90
.

2.

Keeney
 
S
.
Spo11 and the formation of DNA double-strand breaks in meiosis
.
Genome Dyn Stab
 
2008
;
2
:
81
123
.

3.

Myers
 
S
,
Bottolo
 
L
,
Freeman
 
C
, et al.  
A fine-scale map of recombination rates and hotspots across the human genome
.
Science
 
2005
;
310
:
321
4
.

4.

Baudat
 
F
,
Nicolas
 
A
.
Clustering of meiotic double-strand breaks on yeast chromosome III
.
Proc Natl Acad Sci U S A
 
1997
;
94
:
5213
8
.

5.

Lercher
 
MJ
,
Hurst
 
LD
.
Human SNP variability and mutation rate are higher in regions of high recombination
.
Trends Genet
 
2002
;
18
:
337
40
.

6.

Galtier
 
N
,
Piganeau
 
G
,
Mouchiroud
 
D
, et al.  
GC-content evolution in mammalian genomes: the biased gene conversion hypothesis
.
Genetics
 
2001
;
159
:
907
11
.

7.

Webster
 
MT
,
Hurst
 
LD
.
Direct and indirect consequences of meiotic recombination: implications for genome evolution
.
Trends Genet
 
2012
;
28
:
101
9
.

8.

Lynn
 
A
,
Ashley
 
T
,
Hassold
 
T
.
Variation in human meiotic recombination
.
Annu Rev Genomics Hum Genet
 
2004
;
5
:
317
49
.

9.

Mancera
 
E
,
Bourgon
 
R
,
Brozzi
 
A
, et al.  
High-resolution mapping of meiotic crossovers and non-crossovers in yeast
.
Nature
 
2008
;
454
:
479
85
.

10.

Shen
 
Z
,
Lin
 
Y
,
Zou
 
Q
.
Transcription factors-DNA interactions in rice: identification and verification
.
Brief Bioinform
 
2019
;
10.1093/bib/bbz045
.

11.

Pan
 
J
,
Sasaki
 
M
,
Kniewel
 
R
, et al.  
A hierarchical combination of factors shapes the genome-wide topography of yeast meiotic recombination initiation
.
Cell
 
2011
;
144
:
719
31
.

12.

Zhou
 
T
,
Weng
 
J
,
Sun
 
X
, et al.  
Support vector machine for classification of meiotic recombination hotspots and coldspots in Saccharomyces cerevisiae based on codon composition
.
BMC Bioinformatics
 
2006
;
7
:
223
.

13.

Jiang
 
P
,
Wu
 
H
,
Wei
 
J
, et al.  
RF-DYMHC: detecting the yeast meiotic recombination hotspots and coldspots by random forest model using gapped dinucleotide composition features
.
Nucleic Acids Res
 
2007
;
35
:
W47
51
.

14.

Liu
 
B
,
Liu
 
Y
,
Jin
 
X
, et al.  
iRSpot-DACC: a computational predictor for recombination hot/cold spots identification based on dinucleotide-based auto-cross covariance
.
Sci Rep
 
2016
;
6
:
33483
.

15.

Chen
 
W
,
Lei
 
TY
,
Jin
 
DC
, et al.  
PseKNC: a flexible web server for generating pseudo K-tuple nucleotide composition
.
Anal Biochem
 
2014
;
456
:
53
60
.

16.

Chen
 
W
,
Feng
 
PM
,
Lin
 
H
, et al.  
iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition
.
Nucleic Acids Res
 
2013
;
41
:
e68
.

17.

Li
 
L
,
Yu
 
S
,
Xiao
 
W
, et al.  
Sequence-based identification of recombination spots using pseudo nucleic acid representation and recursive feature extraction by linear kernel SVM
.
BMC Bioinformatics
 
2014
;
15
:
340
.

18.

Qiu
 
WR
,
Xiao
 
X
,
Chou
 
KC
.
iRSpot-TNCPseAAC: identify recombination spots with trinucleotide composition and pseudo amino acid components
.
Int J Mol Sci
 
2014
;
15
:
1746
66
.

19.

Zhang
 
BJ
,
Liu
 
GQ
.
Predicting recombination hotspots in yeast based on DNA sequence and chromatin structure
.
Curr Bioinforma
 
2014
;
9
:
28
33
.

20.

Liu
 
G
,
Xing
 
Y
,
Cai
 
L
.
Using weighted features to predict recombination hotspots in Saccharomyces cerevisiae
.
J Theor Biol
 
2015
;
382
:
15
22
.

21.

Dong
 
C
,
Yuan
 
YZ
,
Zhang
 
FZ
, et al.  
Combining pseudo dinucleotide composition with the Z curve method to improve the accuracy of predicting DNA elements: a case study in recombination spots
.
Mol BioSyst
 
2016
;
12
:
2893
900
.

22.

Liu
 
B
,
Wang
 
S
,
Long
 
R
, et al.  
iRSpot-EL: identify recombination spots with an ensemble learning approach
.
Bioinformatics
 
2017
;
33
:
35
41
.

23.

Kabir
 
M
,
Hayat
 
M
.
iRSpot-GAEnsC: identifing recombination spots via ensemble classifier and extending the concept of Chou’s PseAAC to formulate DNA samples
.
Mol Gen Genomics
 
2016
;
291
:
285
96
.

24.

Zhang
 
L
,
Kong
 
L
.
iRSpot-ADPM: identify recombination spots by incorporating the associated dinucleotide product model into Chou’s pseudo components
.
J Theor Biol
 
2018
;
441
:
1
8
.

25.

Zhang
 
L
,
iRSpot-PDI
 
KL
.
Identification of recombination spots by incorporating dinucleotide property diversity information into Chou’s pseudo components
.
Genomics
 
2018
.

26.

Al Maruf
 
MA
,
Shatabda
 
S
.
iRSpot-SF prediction of recombination hotspots by incorporating sequence based features into Chou’s Pseudo components
.
Genomics
 
2018
.

27.

Yang
 
H
,
Qiu
 
WR
,
Liu
 
G
, et al.  
iRSpot-Pse6NC: identifying recombination spots in Saccharomyces cerevisiae by incorporating hexamer composition into general PseKNC
.
Int J Biol Sci
 
2018
;
14
:
883
91
.

28.

Fu
 
L
,
Niu
 
B
,
Zhu
 
Z
, et al.  
CD-HIT: accelerated for clustering the next-generation sequencing data
.
Bioinformatics
 
2012
;
28
:
3150
2
.

29.

Zou
 
Q
,
Lin
 
G
,
Jiang
 
X
, et al.  
Sequence clustering in bioinformatics: an empirical study
.
Brief Bioinform
 
2019
;
10.1093/bib/bby090
.

30.

Chen
 
Z
,
Zhao
 
P
,
Li
 
FY
, et al.  
iFeature: a Python package and web server for features extraction and selection from protein and peptide sequences
.
Bioinformatics
 
2018
;
34
:
2499
502
.

31.

Xu
 
ZC
,
Feng
 
PM
,
Yang
 
H
, et al.  
A computational tool for identifying D modification sites in RNA sequence
.
Bioinformatics
 
2019
.

32.

Liu
 
D
,
Li
 
G
,
Zuo
 
Y
.
Function determinants of TET proteins: the arrangements of sequence motifs with specific codes
.
Brief Bioinform
 
2018
.
DOI: 10.1093/bib/bby053
.

33.

Ding
 
H
,
Deng
 
EZ
,
Yuan
 
LF
, et al.  
iCTX-type: a sequence-based predictor for identifying the types of conotoxins in targeting ion channels
.
Biomed Res Int
 
2014
;
2014
:
286419
.

34.

Zuo
 
Y
,
Li
 
Y
,
Chen
 
Y
, et al.  
PseKRAAC: a flexible web server for generating pseudo K-tuple reduced amino acids composition
.
Bioinformatics
 
2017
;
33
:
122
4
.

35.

Zhu
 
XJ
,
Feng
 
CQ
,
Lai
 
HY
, et al.  
Predicting protein structural classes for low-similarity sequences by evaluating different features
.
Knowl-Based Syst
 
2019
;
163
:
787
93
.

36.

Tang
 
H
,
Chen
 
W
,
Lin
 
H
.
Identification of immunoglobulins using Chou’s pseudo amino acid composition with feature selection technique
.
Mol BioSyst
 
2016
;
12
:
1269
75
.

37.

Lopez
 
P
,
Philippe
 
H
,
Myllykallio
 
H
, et al.  
Identification of putative chromosomal origins of replication in Archaea
.
Mol Microbiol
 
1999
;
32
:
883
6
.

38.

Saeys
 
Y
,
Inza
 
I
,
Larranaga
 
P
.
A review of feature selection techniques in bioinformatics
.
Bioinformatics
 
2007
;
23
:
2507
17
.

39.

Zou
 
Q
,
Zeng
 
J
,
Cao
 
L
, et al.  
A novel features ranking metric with application to scalable visual and bioinformatics data classification
.
Neurocomputing
 
2016
;
173
:
346
54
.

40.

Yang
 
W
,
Zhu
 
XJ
,
Huang
 
J
, et al.  
A brief survey of machine learning methods in protein sub-Golgi localization
.
Curr Bioinforma
 
2019
;
14
:
234
40
.

41.

Long
 
CS
,
Li
 
W
,
Liang
 
PF
, et al.  
Transcriptome comparisons of multi-species identify differential genome activation of mammals embryogenesis
.
Ieee Access
 
2019
;
7
:
7794
802
.

42.

Akay
 
MF
.
Support vector machines combined with feature selection for breast cancer diagnosis
.
Expert Syst Appl
 
2009
;
36
:
3240
7
.

43.

Guyon
 
I
,
Weston
 
J
,
Barnhill
 
S
, et al.  
Gene selection for cancer classification using support vector machines
.
Mach Learn
 
2002
;
46
:
389
422
.

44.

Chen
 
W
,
Feng
 
PM
,
Deng
 
EZ
, et al.  
iTIS-PseTNC: a sequence-based predictor for identifying translation initiation site in human genes using pseudo trinucleotide composition
.
Anal Biochem
 
2014
;
462
:
76
83
.

45.

Chen
 
W
,
Feng
 
PM
,
Lin
 
H
, et al.  
iSS-PseDNC: identifying splicing sites using pseudo dinucleotide composition
.
Biomed Res Int
 
2014
;
2014
:
623149
.

46.

Feng
 
PM
,
Chen
 
W
,
Lin
 
H
, et al.  
iHSP-PseRAAAC: identifying the heat shock protein families using pseudo reduced amino acid alphabet composition
.
Anal Biochem
 
2013
;
442
:
118
25
.

47.

Lai
 
HY
,
Chen
 
XX
,
Chen
 
W
, et al.  
Sequence-based predictive modeling to identify cancerlectins
.
Oncotarget
 
2017
;
8
:
28169
75
.

48.

Yang
 
H
,
Tang
 
H
,
Chen
 
XX
, et al.  
Identification of secretory proteins in mycobacterium tuberculosis using pseudo amino acid composition
.
Biomed Res Int
 
2016
;
2016
:
5413903
.

49.

Zhu
 
PP
,
Li
 
WC
,
Zhong
 
ZJ
, et al.  
Predicting the subcellular localization of mycobacterial proteins by incorporating the optimal tripeptides into the general form of pseudo amino acid composition
.
Mol BioSyst
 
2015
;
11
:
558
63
.

50.

Manavalan
 
B
,
Shin
 
TH
,
Lee
 
G
.
PVP-SVM: sequence-based prediction of phage virion proteins using a support vector machine
.
Front Microbiol
 
2018
;
9
:
476
.

51.

Chang
 
CC
,
Hsu
 
CW
,
Lin
 
CJ
.
The analysis of decomposition methods for support vector machines
.
IEEE Trans Neural Netw
 
2000
;
11
:
1003
8
.

52.

Sch
 
BL
,
Burges
 
CJC
.
Advances in Kernel Methods: Support Vector Learning
.
Cambridge, MA
:
MIT Press
,
1999
.

53.

Breiman
 
L
.
Random forests
.
Mach Learn
 
2001
;
45
:
5
32
.

54.

Breiman
 
L
,
Last
 
M
,
Rice
 
J
.
Random forests: finding quasars
.
Statistical Challenges In Astronomy
 
2003
;
243
54
.

55.

Ru
 
XQ
,
Li
 
LH
,
Zou
 
Q
.
Incorporating distance-based top-n-gram and random forest to identify electron transport proteins
.
J Proteome Res
 
2019
;
18
:
2931
9
.

56.

Svetnik
 
V
,
Liaw
 
A
,
Tong
 
C
, et al.  
Random forest: a classification and regression tool for compound classification and QSAR modeling
.
J Chem Inf Comput Sci
 
2003
;
43
:
1947
58
.

57.

Ke
 
GL
,
Meng
 
Q
,
Finley
 
T
, et al.  
LightGBM: a highly efficient gradient boosting decision tree
.
Adv Neural Inf Proces Syst
 
2017
;
30
(
Nips 2017
):
30
.

58.

Lin
 
H
,
Ding
 
H
,
Guo
 
FB
, et al.  
Predicting subcellular localization of mycobacterial proteins by using Chou’s pseudo amino acid composition
.
Protein Pept Lett
 
2008
;
15
:
739
44
.

59.

Lin
 
H
.
The modified Mahalanobis Discriminant for predicting outer membrane proteins by using Chou’s pseudo amino acid composition
.
J Theor Biol
 
2008
;
252
:
350
6
.

60.

Liu
 
G
,
Liu
 
J
,
Cui
 
X
, et al.  
Sequence-dependent prediction of recombination hotspots in Saccharomyces cerevisiae
.
J Theor Biol
 
2012
;
293
:
49
54
.

61.

Yeung
 
DS
,
Wang
 
DF
,
Ng
 
WWY
, et al.  
Structured large margin machines: sensitive to data distributions
.
Mach Learn
 
2007
;
68
:
171
200
.

62.

Cao
 
R
,
Freitas
 
C
,
Chan
 
L
, et al.  
ProLanGO: protein function prediction using neural machine translation based on a recurrent neural network
.
Molecules
 
2017
;
22
.

63.

Cao
 
RZ
,
Bhattacharya
 
D
,
Hou
 
J
, et al.  
DeepQA: improving the estimation of single protein model quality with deep belief networks
.
BMC Bioinformatics
 
2016
;
17
.

64.

Lin
 
H
,
Liang
 
ZY
,
Tang
 
H
, et al.  
Identifying sigma70 promoters with novel pseudo nucleotide composition
.
IEEE/ACM Trans Comput Biol Bioinform
 
2019
;
16
:
1316
21
.

65.

Yang
 
H
,
Lv
 
H
,
Ding
 
H
, et al.  
iRNA-2OM: a sequence-based predictor for identifying 2’-O-methylation sites in Homo sapiens
.
J Comput Biol
 
2018
;
25
:
1266
77
.

66.

Song
 
J
,
Wang
 
Y
,
Li
 
F
, et al.  
iProt-sub: a comprehensive package for accurately mapping and predicting protease-specific substrates and cleavage sites
.
Brief Bioinform
 
2018
;
20
:
638
58
.

67.

Song
 
J
,
Li
 
F
,
Leier
 
A
, et al.  
PROSPERous: high-throughput prediction of substrate cleavage sites for 90 proteases with improved accuracy
.
Bioinformatics
 
2018
;
34
:
684
7
.

68.

Stephenson
 
N
,
Shane
 
E
,
Chase
 
J
, et al.  
Survey of machine learning techniques in drug discovery
.
Curr Drug Metab
 
2018
.

69.

Manavalan
 
B
,
Subramaniyam
 
S
,
Shin
 
TH
, et al.  
Machine-learning-based prediction of cell-penetrating peptides and their uptake efficiency with improved accuracy
.
J Proteome Res
 
2018
;
17
:
2715
26
.

70.

Tan
 
JX
,
Li
 
SH
,
Zhang
 
ZM
, et al.  
Identification of hormone binding proteins based on machine learning methods
.
Math Biosci Eng
 
2019
;
16
:
2466
80
.

71.

Lv
 
H
,
Zhang
 
ZM
,
Li
 
SH
, et al.  
Evaluation of different computational methods on 5-methylcytosine sites identification
.
Brief Bioinform
 
2019
.
DOI:10.1093/bib/bbz048
.

72.

Tang
 
H
,
Zhao
 
YW
,
Zou
 
P
, et al.  
HBPred: a tool to identify growth hormone-binding proteins
.
Int J Biol Sci
 
2018
;
14
:
957
64
.

73.

Cheng
 
L
,
Jiang
 
Y
,
Ju
 
H
, et al.  
InfAcrOnt: calculating cross-ontology term similarities using information flow by a random walk
.
BMC Genomics
 
2018
;
19
:
919
.

74.

Cheng
 
L
,
Hu
 
Y
,
Sun
 
J
, et al.  
DincRNA: a comprehensive web-based bioinformatics toolkit for exploring disease associations and ncRNA function
.
Bioinformatics
 
2018
;
34
:
1953
6
.

75.

Cheng
 
L
,
Wang
 
P
,
Tian
 
R
, et al.  
LncRNA2Target v2.0: a comprehensive database for target genes of lncRNAs in human and mouse
.
Nucleic Acids Res
 
2019
;
47
:
D140
4
.

76.

Hu
 
Y
,
Zhao
 
T
,
Zhang
 
N
, et al.  
Identifying diseases-related metabolites using random walk
.
BMC Bioinformatics
 
2018
;
19
:
116
.

77.

Myers
 
S
,
Bowden
 
R
,
Tumian
 
A
, et al.  
Drive against hotspot motifs in primates implicates the PRDM9 gene in meiotic recombination
.
Science
 
2010
;
327
:
876
9
.

78.

Borde
 
V
,
Robine
 
N
,
Lin
 
W
, et al.  
Histone H3 lysine 4 trimethylation marks meiotic recombination initiation sites
.
EMBO J
 
2009
;
28
:
99
111
.

79.

Liu
 
YC
,
Li
 
JR
,
Sun
 
CH
, et al.  
CircNet: a database of circular RNAs derived from transcriptome sequencing data
.
Nucleic Acids Res
 
2016
;
44
:
D209
15
.

80.

Lai
 
HY
,
Zhang
 
ZY
,
Su
 
ZD
, et al.  
A computational predictor for predicting promoter
.
Mol Ther Nucleic Acids
 
2019
;
17
:
337
46
.

81.

Chen
 
XX
,
Tang
 
H
,
Li
 
WC
, et al.  
Identification of bacterial cell wall lyases via pseudo amino acid composition
.
Biomed Res Int
 
2016
;
2016
:
1654623
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)