An in silico approach to identification, categorization and prediction of nucleic acid binding proteins

As shown in Figure 1, the dataset was built including DNA-binding and RNA-binding proteins. Each sample was represented by a 473-dimensional vector describing the sequence information and secondary structural information. The parameters of a random forest model were trained using the training samples. The testing samples were then classified by the random forest model with the previously learned parameters.

There are studies focusing on identifying nucleic acid binding proteins and non-nucleic acid binding proteins, and the accuracy of existing methods has been improved over time [1]. However, these methods cannot distinguish the type of binding proteins, such as DNA-binding proteins or RNA-binding proteins, so it is meaningful to be able to predict the functions of these proteins. Based on existing research, our work focused on identifying and distinguishing DNA-binding proteins and RNA-binding proteins. Since the secondary structure of RNA is diverse, it is difficult to identify common characteristics of RNA-binding proteins [2, 3]. As a result, the problem of identifying RNA-binding proteins has been rarely considered in previous studies. Thus, using current methods, RNA-binding proteins are likely to be recognized as DNA-binding proteins [4]. In fact, the function of DNA-binding proteins is quite different from the function of RNA-binding proteins; thus, the study of distinguishing DNA-binding proteins and RNA-binding proteins should be considered. In our current study, RNA-binding proteins were divided into different classes depending on their characteristics, and the identification of DNA-binding proteins was discussed. Furthermore, for the purpose of revealing the biological functions of binding proteins in cellular activities, a protein–nucleic acid binding database called PNIDB was created and can be accessed online. The information, such as functional classification of protein chain and binding events at the sequence level, are all described in PNIDB. Moreover, the database can predict protein–nucleic acid binding events with the information given by PNIDB.

In contrast to previous work, the nucleic–acid binding proteins were further divided into RNA-binding proteins and DNA-binding proteins in the proposed method, and both of them were identified by gene ontology in PNIDB. The rationale was that the functions of proteins can be learned thoroughly when the proteins are identified precisely. The contributions of our work include:

(1) A database describing protein–nucleic acid interactions, known as PNIDB, which can be accessed by researchers at the website http://server.malab.cn/PNIDB/index.html. The information provided by PNIDB can help reveal the biological functions of binding proteins in cellular activities. The proteins in PNIDB are labeled by gene ontology identifiers.
(2) An efficient classifier for predicting DNA-binding proteins and RNA-binding proteins was proposed based on sequence information and secondary structural information, and the accuracy of the proposed method achieved a correlation of 77.43%, which outperforms other methods. The experimental results demonstrated that the combined information could improve the prediction accuracy of our method.
(3) A web server for the prediction of protein–nucleic acid was also developed. The web server was used to predict the function of proteins, which can help researchers in studying in protein–nucleic acid interactions.

In the rest of the paper, the usage manual of PNIDB is introduced first, and then the proposed methodology for the classification of functional proteins is illustrated in detail. The performance of our proposed method is demonstrated by experimental results. Finally, a conclusion is made at the end of the manuscript.

Usage of PNIDB

There are other databases that provide information regarding protein–DNA/RNA interactions. For example, the protein–DNA Interface database (PDIdb) [5] is a repository containing structural information for 922 protein–DNA complexes with a resolution of 2.5 Å or more (while in fact there are 2396 this kind of complexes in the database). The Nucleic Acid–Protein Interaction database (NPIDB) [6] [7] contains structural classifications and detailed information on both DNA–protein and RNA–protein complexes extracted from PDB. The current version of NPIDB contains 5046 structures, while PNIDB contains 6228 PDB structures overall. In contrast to the above databases, proteins are denoted by gene ontology in PNIDB.

PNIDB provides detailed atom-based interaction information. The significance of PNIDB is that it specifically focuses on sequence-level annotations and provides functional clustering, which should be of benefit for sequence-based research and functional prediction of protein–nucleic acid interactions.

PNIDB is a repository of protein–nucleic acid interaction information derived from 6798 nucleic acid-containing structures collected from the Protein Data Bank [8]. The Bio3D [9] package was used to read, analyze and manipulate PDB structures in R (http://www.R-project.org/). For each PDB file, we identified the proteins, and the nucleic acid chains were then extracted. Moreover, possible binding residues of the protein chain, which were defined as the residues with at least one atom within 5 Å from any nucleic acid atom, and corresponding binding nucleotides of the nucleic acid chain were also calculated in PNIDB. PNIDB can be accessed at the website http://server.malab.cn:8080/PNIDB/Statistics.html. There are totally 36 425 interactions in PNIDB, in which 40% of them are protein–RNA interactions and 60% of them are protein–DNA interactions. The information of PNIDB are summarized in Table 1. The class information of DNA-binding interactions and RNA-binding interactions are shown in Table 1.

Table 1

The component information of PNIDB

Type of interaction	Class	No. of samples
DNA-binding interactions	Cell cycle	101
	Gene regulation	490
	Hydrolase	3717
	Immune system	234
	Isomerase	635
	Ligase	209
	Lyase	552
	Metal-binding protein	108
	Nuclear protein	51
	Recombination	312
	Oxidoreductase	112
	Replication	418
	Structural protein	2161
	Transcription	5013
	Transferase	4727
	Transport protein	37
	Viral protein	185
RNA-binding interaction	Splicing	470
	Ligase	211
	Isomerase	70
	Immune system	223
	Hydrolase	818
	Ribosome	8409
	Transcription	645
	Transferase	957
	Translation	468
	Viral protein	411

Type of interaction	Class	No. of samples
DNA-binding interactions	Cell cycle	101
	Gene regulation	490
	Hydrolase	3717
	Immune system	234
	Isomerase	635
	Ligase	209
	Lyase	552
	Metal-binding protein	108
	Nuclear protein	51
	Recombination	312
	Oxidoreductase	112
	Replication	418
	Structural protein	2161
	Transcription	5013
	Transferase	4727
	Transport protein	37
	Viral protein	185
RNA-binding interaction	Splicing	470
	Ligase	211
	Isomerase	70
	Immune system	223
	Hydrolase	818
	Ribosome	8409
	Transcription	645
	Transferase	957
	Translation	468
	Viral protein	411

Table 1

The component information of PNIDB

Type of interaction	Class	No. of samples
DNA-binding interactions	Cell cycle	101
	Gene regulation	490
	Hydrolase	3717
	Immune system	234
	Isomerase	635
	Ligase	209
	Lyase	552
	Metal-binding protein	108
	Nuclear protein	51
	Recombination	312
	Oxidoreductase	112
	Replication	418
	Structural protein	2161
	Transcription	5013
	Transferase	4727
	Transport protein	37
	Viral protein	185
RNA-binding interaction	Splicing	470
	Ligase	211
	Isomerase	70
	Immune system	223
	Hydrolase	818
	Ribosome	8409
	Transcription	645
	Transferase	957
	Translation	468
	Viral protein	411

Type of interaction	Class	No. of samples
DNA-binding interactions	Cell cycle	101
	Gene regulation	490
	Hydrolase	3717
	Immune system	234
	Isomerase	635
	Ligase	209
	Lyase	552
	Metal-binding protein	108
	Nuclear protein	51
	Recombination	312
	Oxidoreductase	112
	Replication	418
	Structural protein	2161
	Transcription	5013
	Transferase	4727
	Transport protein	37
	Viral protein	185
RNA-binding interaction	Splicing	470
	Ligase	211
	Isomerase	70
	Immune system	223
	Hydrolase	818
	Ribosome	8409
	Transcription	645
	Transferase	957
	Translation	468
	Viral protein	411

Protein chains in interaction pairs were classified according to their mmCIF keywords, interaction type and gene ontology [10] terms. There were a total of 84 753 chains extracted from those structures, in which 20 927 chains contained nucleic acid binding residues. All the protein chains were clustered into 27 functional groups, with 17 kinds of DNA-binding proteins and 10 kinds of RNA-binding proteins. Moreover, each protein chain in these interactions was linked to their respective accession numbers from UniProt as well as the corresponding InterPro identifiers [11] and GO identifiers [10, 12] mapped from the SIFTS project [13].

For convenience, the residues and nucleotides are cited by their relative position in the sequence of their separate chains. In addition, the 2D and 3D visualization interfaces are provided online. Figure 2A and B shows the interfaces of the 2D and 3D visualization, respectively, in PNIDB. In Figure 2B, the visualization interface focuses on nucleic acid sequences. The binding protein residue and the position are highlighted in the figure. The residues exceeding 3.9 Å are considered binding residues.

Figure 2

(A) 3D visualization interface of the interaction between a protein and a nucleic acid residue. (B) 2D visualization interface of an interaction (solid line for hydrogen bond). (C) The web server for binding protein prediction. (D) Results provided by the web-based predictor. (E) Search page on the web server. (F) Comparison with other methods by ACC. (nucleic acids are labeled in red, while protein residues are in gray, and in the 2D visualization, the dash line denotes residues within 3.9 Å).

A search page is also provided online, and users can search by keywords. A quick search and advanced search were implemented. In quick search mode, users can start a search by specifying a keyword, such as a PDB ID [8] organism, interaction type, classification or Uniprot accession number [14]. In the advanced search mode, for the convenience of other researchers, a web server for binding protein prediction was developed. The web server can handle up to 10 fasta sequences at the same time. Then, the results are returned by email.

The web server is shown in Figure 2C. The results predicted by a learned classifier are shown in Figure 2D. The search page of PNIDB is provided, which is shown in Figure 2E. There are three options, ‘search by interaction type,’ ‘search by organism’ and ‘search by classification.’ Users can search for proteins by describing the requirements. Moreover, proteins can be searched by combining several parameters simultaneously. The matching results will be retrieved when the requirements are submitted. Furthermore, more information can be referred using PNIDB, such as the molecule name of the protein chain, the sequence of the protein chain, the binding residues of the protein chain, the sequence of the nucleic acid chain, the binding nucleotides of the nucleic acid chain, the corresponding InterPro IDs [11] of the protein chain, the GO identifiers of the protein chain and the 3D visualization with labels of contacting residues and nucleotides based on 3Dmol.js [15]. The schematic diagrams of protein–nucleic acid interactions based on NUCPLOT [16] can be obtained by clicking the tab on the webpage.

For the convenience of related study, all binding residues/nucleotides were renumbered according to their corresponding chain sequence. In addition, users can also browse the interactions in the browse page by selecting specific classifications of the protein chains in the menu on the left side.

Methodology

The method for predicting the function of proteins is described in this section. First, the benchmark dataset used is introduced. Then, the method of feature extraction is described. Lastly, the classification of the binding proteins is explained.

Benchmark dataset

The data used in this work was selected from the SwissProt dataset (https://web.expasy.org/docs/swiss-prot_guideline.html). The data in SwissProt contained GO protein sequences, which were selected from the UniProt dataset (https://www.uniprot.org/) with high confidence. The SwissProt dataset was composed of DNA-binding proteins and RNA-binding proteins. The DNA-binding proteins had non-IEA source gene ontologies. The sequences that were more similar were removed using CD-HIT [17]. The similarity degree between the sequences used in our experiments was less than 30%. The benchmark dataset contained five classes. The benchmark dataset used in these experiments is summarized in Table 2. The preprocess of benchmark dataset is shown in Figure 3.

Table 2

Sample detail of benchmark dataset

Class	No. of samples
DNA-binding transcription factor	200
Helicase activity	256
Nuclease	200
RBP spliceosomal complex	199
RBP structural constituent of ribosome	213

Table 2

Sample detail of benchmark dataset

Class	No. of samples
DNA-binding transcription factor	200
Helicase activity	256
Nuclease	200
RBP spliceosomal complex	199
RBP structural constituent of ribosome	213

Figure 3

The flowchart of preprocess on benchmark dataset.

Figure 4

Comparison by Sp.

Figure 5

Process of random forest model generation Breiman L. Random Forests[J]. Machine Learning, 2001, 45(1):5–32.

Feature extraction

In the literature, residue sequences are usually represented by a vector v before the process of prediction. An efficient feature set is expected to distinguish positive samples and negative samples with high accuracy. The quality of a feature set is critical to the performance of any predictor. In our method, a sequence S is represented by the sequential evolution information and secondary structural information. There are totally 473 features extracted. 400 of them are evolution information, and 73 features are used to represent secondary structural information. The features used in our method were extracted from sequence S. The features include PSI-BLAST features [18] and PSI-PRED features [19]. PSI-BLAST describes the evolutional information, and the secondary structural information is shown by PSI-PRED. The combination of these features has been previously used for protein fold prediction [20].

A protein sequence S_L is denoted as S_L = {R₁,R₂, … ,R_L} and L is the length of the residue. PSI-BLAST is based on the protein database nrdb90 [21] and a position-specific score matrix (PSSM). A PSSM is a matrix with L × 20, written as Equation (1):

$$\begin{equation} {\left[\begin{array}{ccc}{M}_{1,1}& \cdots & {M}_{1,20}\\{}\vdots & \ddots & \vdots \\{}{M}_{L,1}& \cdots & {M}_{L,20}\end{array}\right]}_{L\times 20} \end{equation}$$

(1)

M_i,j is denoted as the score of the residue at the ith position of S_L being mutated to the residue type j during an evolutionary process. The 20 features are computed by the average value of each column (Equation (2)):

$$\begin{equation} {F}_v={\overline{M}}_j=\frac{1}{L}{\sum}_{i=1}^L{M}_{i,j}\ j=1,\dots, 20 \end{equation}$$

(2)

The evolution information is also used for describing a sequence. During the evolutionary process, an amino acid located at the ith position in the residue may be mutated to type j, and the score of it is denoted as M_ij (Equation (3)):

$$\begin{equation} {M}_{i,j}^{\prime }={2}^{M_{i,j}\times{P}_j} \end{equation}$$

(3)

P_j represents the background frequency of residue type j, and the background depends on the average occurrence frequency of all 20 amino acids in each sequence of the protein database PDB25 [22].

|${\delta}_{\mathrm{i}}$| is represented as the residue located at the ith position in an original sequence, S_L, replaced by the |${\delta}_{\mathrm{i}}$|th amino acid in the amino acid alphabet. The sequence S_L is transferred into a consensus sequence S_con by using |${\delta}_i$|⁠. The frequency of |$\mathrm{each}\ {\delta}_i$| in the sequence is denoted as a feature. Thus, there are 20 amino acids, so 20 features are extracted from each sequence.

R_i represents the ith residue of a peptide. There are 20 types of native amino acids, which means that there are 20 possible values for R_i. Thus, the number of two consecutive amino acids R_iR_j is 400 (20 × 20), meaning that the number of dimensions describing the occurrence frequency of two constitute amino acids is 400. The 400-dimensional features have been widely used in the literature of bioinformatics, such as Alzheimer’s disease identification [23] and detection of anticancer peptides [24]. Thus, sequence information can be revealed using these 400-dimensional features. In contrast to using 400-dimensional features, the features used in our work were based on the consensus sequence S_con. The frequency of |${\delta}_i{\delta}_j$| is denoted as a feature. The frequency of the occurrence of |${\delta}_i{\delta}_j$| is denoted as a feature in v. Therefore, another 400 features are extracted from each sequence.

PSI-PRED-based features have been widely used in secondary structure prediction. The features include six structure–sequence-based features and a × 3ⁿ + 3ⁿ structure probability matrix-based features. The value of a is set to be 8, and n is 1. Thus, there are 33 PSI-PRED features. Therefore, there are 473 (20 + 20 + 400 + 33) features used to represent a sequence S_L in total.

Classification

Support vector machine, naive Bayes and ensemble methods have all been widely used in bioinformatics, such as prediction of tumor detection [25], function prediction of proteins and disease detection [26]. The performance of a predictor is also related to the classifier used. Thus, an efficient classifier is critical for the performance of a computational predictor. In our work, a random forest model was used to predict the function of proteins.

Random forest models are a type of ensemble classifier. The key idea of random forest model is that a number of decision trees are used together for prediction (Figure 5). The decision trees are trained by the datasets, which are built based on bagging. Each decision tree makes a decision, and the final decision is made by a voting process. A sample is then classified into the class with the most votes. The process of a random forest model is shown in Figure 3.

Results

To demonstrate the efficiency of our proposed model, our proposed method is compared with other methods, which have been used widely in the literature.

188D is proposed to extract features from a sequence [27]. The amino acid composition, distribution and physicochemical property are described in 188D. This method has been used in bioinformatics, such as for the identification of antioxidant proteins [28].

The Kmer (k = 2) method extracts features representing the occurrence frequency of k consecutive amino acid in a residue. In our experiments, k was set to 2. The number of dimensions of Kmer features is (n – k) + 1, where n is the length of residue.

PC-PseAAC was proposed based on pseudo amino acid components (PseAAC) and has been used in protein identification [29]. The information on location residue and global residue are mixed into PseAAC in PC-PseAAC.

An autocorrelation (AC) is the correlation between any two residues with distance lag on the same properties [30].

The experiments were based on a 10-fold cross validation. In a 10-fold cross validation, a dataset is divided into 10 parts. Ninety percent of samples are then used for training parameters, and the remaining 10% are considered as testing samples.

The evaluation metric was measured by accuracy (ACC), which was denoted as the rate of correctly classified samples using method G. ACC has been widely used in the literature [31]. The dataset was composed of both positive and negative samples. The result set was divided into four parts, which were true positive (TP), false positive (FP), true negative (TN) and false negative (FN). TP is the number of positive samples classified correctly. FP is the number of negative samples labeled as positive samples. TN denotes the number of negative samples that were labeled correctly. FN is the number of positive samples that were recognized as negative samples. The accuracy (ACC) of method G was computed using Equation (4):

$$\begin{equation} {\mathrm{ACC}}_G=\frac{\mathrm{TP}+\mathrm{TN}}{\mathrm{TP}+\mathrm{TN}+\mathrm{FP}+\mathrm{FN}} \end{equation}$$

(4)

The comparison of the feature sets is shown in Figure 2F. For the purpose of convincing our work, specificity (Sp) is measured and compared. Specificity is denoted as Equation (5):

$$\begin{equation} \mathrm{Sp}=\frac{\mathrm{TN}}{\mathrm{TN}+\mathrm{FP}} \end{equation}$$

(5)

The experimental results (Figures 2F and 4) show that our proposed method performed better than other feature sets. The combination of sequence information features improved the prediction accuracy. In fact, the features used in Kmer (k = 2) were a part of the features in 473D. In the experimental results, the accuracy was improved by nearly 19% (((77.43–65.07)/65.07)%) when the sequences were represented by the PSI-BLAST and PSI-PRED features compared with only using the Kmer (k = 2) features. The sequence information was also extracted in 188D, so the accuracy was 0.69, which outperforms other existing methods except 473D. The accuracy of AC and CC were 0.41 and 0.47, respectively. The accuracy of the combination of AC and CC was 0.4625, which was not as good as other methods, such as 188D. In the experiments, the features of AC and CC were not suitable for predicting the function of proteins. The sequence information and secondary structural information were helpful for improving the accuracy and have been used in our proposed method.

Conclusions

Due to the rapid growth in the number of protein sequences without identification of their functions, a database describing the protein–nucleic acid interactions (PNIDB) was provided in our work. The functions of sequences were labeled using GO identifiers in PNIDB. PNIDB provides a convenient and user-friendly interface to query and browse detailed information on protein–nucleic acid interactions. Different from existing databases, PNIDB focuses on both protein–DNA and protein–RNA interactions, and the functional classifications are considered at the sequence level. Moreover, a benchmark database is available for the prediction of protein–nucleic acid binding events at either the protein residue level or nucleotide level. PNIDB will also aid in the functional prediction of nucleic-binding proteins based on protein sequence and may help for providing putative drug targets and novel therapy options. The problem of classification of DNA-binding proteins and RNA-binding proteins was also considered in this work. The sequences are represented by PSI-BLAST features and PSI-PRED features, and a random forest model was used to predict the type of protein, such as DNA-binding proteins and RNA-binding proteins. The accuracy of our proposed method was 0.774, which performs better than other methods. A web server for protein prediction was provided online for the convenience of other researchers. Above all, PNIDB labeled by gene ontology identifiers was built for describing the function of proteins, and a computational predictor was developed for classifying DNA-binding proteins and RNA-binding proteins.

Key Points

A database describing protein–nucleic acid interactions, known as PNIDB, is built and can be accessed.
An efficient classifier for predicting DNA-binding proteins and RNA-binding proteins was proposed based on sequence information and secondary structural information, and the accuracy of the proposed method achieved a correlation of 77.43%, which outperforms other methods.
A web server for the prediction of protein–nucleic acid was also developed. The web server was used to predict the function of proteins, which can help researchers in studying protein–nucleic acid interactions.

Availability and Implementation

PNIDB is now fully working and can be freely accessed at http://server.malab.cn/PNIDB/index.html. All the data are publicly available for noncommercial use, distribution and reproduction in any medium.

Funding

This work was supported by the Natural Science Foundation of China (Nos. 61902259, 61771331), the Natural Science Foundation of Guangdong Province (grant no. 2018A0303130084) and the Science and Technology Innovation Commission of Shenzhen (grant no. JCYJ20170818100431895).Conflicts of Interest: The authors declare no conflicts of interest.

Lei Xu is an associate professor at the School of Electronic and Communication Engineering, Shenzhen Polytechnic. She received her BSc and MSc from the School of Computer Science and Technology in Harbin Institute of Technology in 2006 and 2008, respectively. She got her PhD degree from the Department of Computing, The Hong Kong Polytechnic University, in October 2013. Her research interests are focused on bioinformatics and pattern recognition.

Shanshan Jiang is a research student in Peking University. She majors in bioinformatics, machine learning and algorithms.

Quan Zou is a professor in Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China. He is a senior member of IEEE and ACM. He majors in bioinformatics, machine learning and algorithms.

Jin Wu is an assistant professor at the School of Management, Shenzhen Polytechnic. Her research interests are focused on bioinformatics and biology.

References

1.

Qu

K

,

Zou

Q

.

A review of DNA-binding proteins prediction methods

.

Curr Bioinformatics

2018

;

14

(

3

).

2.

Gao

X

, et al.

iRBP-motif-PSSM: identification of RNA-binding proteins based on collaborative learning

.

IEEE Access

2019

;

7

:

168956

–

62

.

3.

Bin

L

,

Xin

G

,

Hanyu

Z

.

BioSeq-Analysis2.0: an updated platform for analyzing DNA, RNA and protein sequences at sequence level and residue level based on machine learning approaches

.

Nucleic Acids Res

2019

(

20

):

20

.

4.

Zhang

J

,

Chen

Q

,

Liu

B

.

DeepDRBP-2L: a new genome annotation predictor for identifying DNA binding proteins and RNA binding proteins using convolutional neural network and long short-term memory

.

IEEE/ACM Trans Comput Biol Bioinform

2019

;

11

.

5.

Norambuena, T.a.M., F

.

The protein-DNA Interface database

.

BMC Bioinformatics

2010

;

11

:

262

.

PubMed

6.

Kirsanov

DD

,

Zanegina

O

,

Aksianov

EA

, et al.

NPIDB: nucleic acid - protein interaction database

.

Nucleic Acids Res

2013

;(

41

):

D517

–

23

.

7.

Olga

Z

,

Dmitriy

K

,

Eugene

B

, et al.

An updated version of NPIDB includes new classifications of DNA–protein complexes and their families

.

Nucleic Acids Res

2016

;(

D1

):

D144

–

53

.

8.

Burley

SK

, et al.

RCSB Protein Data Bank: Sustaining a Living Digital Data Resource that Enables Breakthroughs in Scientific Research and Biomedical Education

.

Protein Science A Publication of the Protein Society

,

2015

.

Google Preview

9.

Skjaerven

L

,

Yao

XQ

,

Scarabelli

G

, et al.

Integrating protein structural dynamics and evolutionary analysis with Bio3D

.

Bmc Bioinformatics

2014

;

15

(

1

):

399

.

10.

Ashburner

M

,

Ball

CA

,

Blake

JA

, et al.

Gene ontology: tool for the unification of biology

.

Nat Genet

2000

;

25

(

1

):

25

–

9

.

11.

Finn

RDAA

.

InterPro in 2017—beyond protein family and domain annotations

.

Nucleic Acids Res

2017

;

45

(

D1

):

D190

–

9

.

12.

Carbon

S

.

Expansion of the gene ontology knowledgebase and resources: the gene ontology consortium

.

Nucleic Acids Res

2017

;

45

:

D331

–

8

.

PubMed

13.

Velankar

S

.

SIFTS: structure integration with function, taxonomy and sequences resource

.

Nucleic Acids Res

2013

;

41

:

483

–

9

.

14.

Rolf

A

, et al.

UniProt: the universal protein knowledgebase

.

Nucleic Acids Res

2004

;(

32

):

D115

.

15.

Rego

N

,

Koes

D

.

3Dmol.Js: molecular visualization with WebGL

.

Bioinformatics

2015

;

31

(

8

):

1322

–

4

.

16.

Luscombe and N

.

NUCPLOT: a program to generate schematic diagrams of protein-nucleic acid interactions

.

Nucleic Acids Res

1997

;

25

(

24

):

4940

–

5

.

PubMed

17.

Li

W

,

Godzik

AJB

.

Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences

.

2006

;

22

(

13

):

1658

.

18.

Altschul

SF

, et al.

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs

.

Nucleic Acids Res

1997

;

25

(

17

):

3389

–

402

.

19.

Jones

D

.

Protein secondary structure prediction based on position-specific scoring matrices

.

J Mol Biol

1999

;

292

(

2

):

195

–

202

.

20.

Wei

L

,

Liao

M

,

Gao

X

, et al.

Enhanced protein fold prediction method through a novel feature extraction technique

.

IEEE Trans Nanobioscience

2015

;

14

(

6

):

649

–

59

.

21.

Holm

LCS

.

Removing near-neighbour redundancy from large protein sequence collections

.

Bioinformatics

1998

;

14

:

423

–

9

.

22.

Sussman

JL

,

Lin

D

,

Jiang

J

, et al.

Protein data Bank (PDB): database of three-dimensional structural information of biological macromolecules

.

Acta Crystallogr

2010

;

54

(

6–1

):

1078

–

84

.

23.

Xu

L

,

Liang

G

,

Liao

C

, et al.

An efficient classifier for Alzheimer’s disease genes identification

.

Molecules

2018

;

23

(

12

):

3140

.

24.

Xu

L

,

Liang

G

,

Liao

C

, et al.

A novel hybrid sequence-based model for identifying anticancer peptides

.

Genes

2018

;

9

(

3

):

158

.

25.

Tang

W

,

Wan

S

,

Yang

Z

, et al.

Tumor origin detection with tissue-specific miRNA and DNA methylation markers

.

Bioinformatics

2018

;

34

(

3

):

398

–

406

.

26.

Xu

L

,

Liang

G

,

Lian

C

, et al.

K-skip-n-gram-RF: a random forest based method for Alzheimer's disease protein identification

.

Front Genet

2019

;

10

:

33

. DOI:

10.3389/fgene.2019.00033

27.

Cai

CZ

,

Han

LY

,

Ji

ZL

, et al.

SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence

.

Nucleic Acids Res

2003

;

31

(

13

):

3692

–

7

.

28.

Xu

L

,

Liang

G

,

Shi

S

, et al.

SeqSVM: a sequence-based support vector machine method for identifying antioxidant proteins

.

Int J Mol Sci

2018

;

19

(

6

):

1773

.

29.

Chou

KC

.

Some remarks on protein attribute prediction and pseudo amino acid composition

.

J Theor Biol

2011

;

273

(

1

):

236

–

47

.

30.

Liu

B

,

Liu

F

,

Fang

L

, et al.

repDNA: a python package to generate various modes of feature vectors for DNA sequences by incorporating user-defined physicochemical properties and sequence-order effects

.

Bioinformatics

2015

;

31

(

8

):

1307

.

31.

Chen

J

,

Peng

H

,

Han

G

, et al.

HOGMMNC: a higher order graph matching with multiple network constraints model for gene–drug regulatory modules identification

.

Bioinformatics

2018

;

35

(

4

):

602

–

10

.