Abstract

The interaction between proteins and nucleic acid plays an important role in many processes, such as transcription, translation and DNA repair. The mechanisms of related biological events can be understood by exploring the function of proteins in these interactions. The number of known protein sequences has increased rapidly in recent years, but the databases for describing the structure and function of protein have unfortunately grown quite slowly. Thus, improving such databases is meaningful for predicting protein–nucleic acid interactions. Furthermore, the mechanism of related biological events, such as viral infection or designing novel drug targets, can be further understood by understanding the function of proteins in these interactions. The information for each sequence, including its function and interaction sites, were collected and identified, and a database called PNIDB was built. The proteins in PNIDB were grouped into 27 classes, such as transcription, immune system, and structural protein, etc. The function of each protein was then predicted using a machine learning method. Using our method, the predictor was trained on labeled sequences, and then the function of a protein was predicted based on the trained classifier. The prediction accuracy achieved a score of 77.43% by 10-fold cross validation.

Introduction

As the chief actors within the cells, proteins are involved in many essential activities in the cell, and the interactions between proteins with nucleic acid are extremely important for many cellular processes, such as transcription, translation and DNA repair. Therefore, the study of protein–nucleic acid binding activities can help with understanding protein interaction networks or even the mechanism of related cellular processes. With the development of sequencing technology, the amount of protein sequence information has increased rapidly over recent years, but the growth rate of databases describing the structure and function of proteins has been very slow, and cannot match the growth rate of protein sequence databases. Thus, it is essential to narrow the gap between sequence database size and functional database size by characterizing nucleic acid binding proteins and their functional groups. Therefore, a database called PNIDB (Protein–Nucleic Acid Interactions Database, PNIDB for short), in which DNA-binding proteins and RNA-binding proteins are denoted by gene ontology, was built. Sequence information was extracted, and an efficient classification method for predicting DNA-binding and RNA-binding proteins is proposed in this article. The process of our work is summarized in Figure 1.

 The process of protein prediction.
Figure 1

The process of protein prediction.

As shown in Figure 1, the dataset was built including DNA-binding and RNA-binding proteins. Each sample was represented by a 473-dimensional vector describing the sequence information and secondary structural information. The parameters of a random forest model were trained using the training samples. The testing samples were then classified by the random forest model with the previously learned parameters.

There are studies focusing on identifying nucleic acid binding proteins and non-nucleic acid binding proteins, and the accuracy of existing methods has been improved over time [1]. However, these methods cannot distinguish the type of binding proteins, such as DNA-binding proteins or RNA-binding proteins, so it is meaningful to be able to predict the functions of these proteins. Based on existing research, our work focused on identifying and distinguishing DNA-binding proteins and RNA-binding proteins. Since the secondary structure of RNA is diverse, it is difficult to identify common characteristics of RNA-binding proteins [2, 3]. As a result, the problem of identifying RNA-binding proteins has been rarely considered in previous studies. Thus, using current methods, RNA-binding proteins are likely to be recognized as DNA-binding proteins [4]. In fact, the function of DNA-binding proteins is quite different from the function of RNA-binding proteins; thus, the study of distinguishing DNA-binding proteins and RNA-binding proteins should be considered. In our current study, RNA-binding proteins were divided into different classes depending on their characteristics, and the identification of DNA-binding proteins was discussed. Furthermore, for the purpose of revealing the biological functions of binding proteins in cellular activities, a protein–nucleic acid binding database called PNIDB was created and can be accessed online. The information, such as functional classification of protein chain and binding events at the sequence level, are all described in PNIDB. Moreover, the database can predict protein–nucleic acid binding events with the information given by PNIDB.

In contrast to previous work, the nucleic–acid binding proteins were further divided into RNA-binding proteins and DNA-binding proteins in the proposed method, and both of them were identified by gene ontology in PNIDB. The rationale was that the functions of proteins can be learned thoroughly when the proteins are identified precisely. The contributions of our work include:

  • (1) A database describing protein–nucleic acid interactions, known as PNIDB, which can be accessed by researchers at the website http://server.malab.cn/PNIDB/index.html. The information provided by PNIDB can help reveal the biological functions of binding proteins in cellular activities. The proteins in PNIDB are labeled by gene ontology identifiers.

  • (2) An efficient classifier for predicting DNA-binding proteins and RNA-binding proteins was proposed based on sequence information and secondary structural information, and the accuracy of the proposed method achieved a correlation of 77.43%, which outperforms other methods. The experimental results demonstrated that the combined information could improve the prediction accuracy of our method.

  • (3) A web server for the prediction of protein–nucleic acid was also developed. The web server was used to predict the function of proteins, which can help researchers in studying in protein–nucleic acid interactions.

In the rest of the paper, the usage manual of PNIDB is introduced first, and then the proposed methodology for the classification of functional proteins is illustrated in detail. The performance of our proposed method is demonstrated by experimental results. Finally, a conclusion is made at the end of the manuscript.

Usage of PNIDB

There are other databases that provide information regarding protein–DNA/RNA interactions. For example, the protein–DNA Interface database (PDIdb) [5] is a repository containing structural information for 922 protein–DNA complexes with a resolution of 2.5 Å or more (while in fact there are 2396 this kind of complexes in the database). The Nucleic Acid–Protein Interaction database (NPIDB) [6] [7] contains structural classifications and detailed information on both DNA–protein and RNA–protein complexes extracted from PDB. The current version of NPIDB contains 5046 structures, while PNIDB contains 6228 PDB structures overall. In contrast to the above databases, proteins are denoted by gene ontology in PNIDB.

PNIDB provides detailed atom-based interaction information. The significance of PNIDB is that it specifically focuses on sequence-level annotations and provides functional clustering, which should be of benefit for sequence-based research and functional prediction of protein–nucleic acid interactions.

PNIDB is a repository of protein–nucleic acid interaction information derived from 6798 nucleic acid-containing structures collected from the Protein Data Bank [8]. The Bio3D [9] package was used to read, analyze and manipulate PDB structures in R (http://www.R-project.org/). For each PDB file, we identified the proteins, and the nucleic acid chains were then extracted. Moreover, possible binding residues of the protein chain, which were defined as the residues with at least one atom within 5 Å from any nucleic acid atom, and corresponding binding nucleotides of the nucleic acid chain were also calculated in PNIDB. PNIDB can be accessed at the website http://server.malab.cn:8080/PNIDB/Statistics.html. There are totally 36 425 interactions in PNIDB, in which 40% of them are protein–RNA interactions and 60% of them are protein–DNA interactions. The information of PNIDB are summarized in Table 1. The class information of DNA-binding interactions and RNA-binding interactions are shown in Table 1.

Table 1

The component information of PNIDB

Type of interactionClassNo. of samples
DNA-binding interactionsCell cycle101
Gene regulation490
Hydrolase3717
Immune system234
Isomerase635
Ligase209
Lyase552
Metal-binding protein108
Nuclear protein51
Recombination312
Oxidoreductase112
Replication418
Structural protein2161
Transcription5013
Transferase4727
Transport protein37
Viral protein185
RNA-binding interactionSplicing470
Ligase211
Isomerase70
Immune system223
Hydrolase818
Ribosome8409
Transcription645
Transferase957
Translation468
Viral protein411
Type of interactionClassNo. of samples
DNA-binding interactionsCell cycle101
Gene regulation490
Hydrolase3717
Immune system234
Isomerase635
Ligase209
Lyase552
Metal-binding protein108
Nuclear protein51
Recombination312
Oxidoreductase112
Replication418
Structural protein2161
Transcription5013
Transferase4727
Transport protein37
Viral protein185
RNA-binding interactionSplicing470
Ligase211
Isomerase70
Immune system223
Hydrolase818
Ribosome8409
Transcription645
Transferase957
Translation468
Viral protein411
Table 1

The component information of PNIDB

Type of interactionClassNo. of samples
DNA-binding interactionsCell cycle101
Gene regulation490
Hydrolase3717
Immune system234
Isomerase635
Ligase209
Lyase552
Metal-binding protein108
Nuclear protein51
Recombination312
Oxidoreductase112
Replication418
Structural protein2161
Transcription5013
Transferase4727
Transport protein37
Viral protein185
RNA-binding interactionSplicing470
Ligase211
Isomerase70
Immune system223
Hydrolase818
Ribosome8409
Transcription645
Transferase957
Translation468
Viral protein411
Type of interactionClassNo. of samples
DNA-binding interactionsCell cycle101
Gene regulation490
Hydrolase3717
Immune system234
Isomerase635
Ligase209
Lyase552
Metal-binding protein108
Nuclear protein51
Recombination312
Oxidoreductase112
Replication418
Structural protein2161
Transcription5013
Transferase4727
Transport protein37
Viral protein185
RNA-binding interactionSplicing470
Ligase211
Isomerase70
Immune system223
Hydrolase818
Ribosome8409
Transcription645
Transferase957
Translation468
Viral protein411

Protein chains in interaction pairs were classified according to their mmCIF keywords, interaction type and gene ontology [10] terms. There were a total of 84 753 chains extracted from those structures, in which 20 927 chains contained nucleic acid binding residues. All the protein chains were clustered into 27 functional groups, with 17 kinds of DNA-binding proteins and 10 kinds of RNA-binding proteins. Moreover, each protein chain in these interactions was linked to their respective accession numbers from UniProt as well as the corresponding InterPro identifiers [11] and GO identifiers [10, 12] mapped from the SIFTS project [13].

For convenience, the residues and nucleotides are cited by their relative position in the sequence of their separate chains. In addition, the 2D and 3D visualization interfaces are provided online. Figure 2A and B shows the interfaces of the 2D and 3D visualization, respectively, in PNIDB. In Figure 2B, the visualization interface focuses on nucleic acid sequences. The binding protein residue and the position are highlighted in the figure. The residues exceeding 3.9 Å are considered binding residues.

(A) 3D visualization interface of the interaction between a protein and a nucleic acid residue. (B) 2D visualization interface of an interaction (solid line for hydrogen bond). (C) The web server for binding protein prediction. (D) Results provided by the web-based predictor. (E) Search page on the web server. (F) Comparison with other methods by ACC. (nucleic acids are labeled in red, while protein residues are in gray, and in the 2D visualization, the dash line denotes residues within 3.9 Å).
Figure 2

(A) 3D visualization interface of the interaction between a protein and a nucleic acid residue. (B) 2D visualization interface of an interaction (solid line for hydrogen bond). (C) The web server for binding protein prediction. (D) Results provided by the web-based predictor. (E) Search page on the web server. (F) Comparison with other methods by ACC. (nucleic acids are labeled in red, while protein residues are in gray, and in the 2D visualization, the dash line denotes residues within 3.9 Å).

A search page is also provided online, and users can search by keywords. A quick search and advanced search were implemented. In quick search mode, users can start a search by specifying a keyword, such as a PDB ID [8] organism, interaction type, classification or Uniprot accession number [14]. In the advanced search mode, for the convenience of other researchers, a web server for binding protein prediction was developed. The web server can handle up to 10 fasta sequences at the same time. Then, the results are returned by email.

The web server is shown in Figure 2C. The results predicted by a learned classifier are shown in Figure 2D. The search page of PNIDB is provided, which is shown in Figure 2E. There are three options, ‘search by interaction type,’ ‘search by organism’ and ‘search by classification.’ Users can search for proteins by describing the requirements. Moreover, proteins can be searched by combining several parameters simultaneously. The matching results will be retrieved when the requirements are submitted. Furthermore, more information can be referred using PNIDB, such as the molecule name of the protein chain, the sequence of the protein chain, the binding residues of the protein chain, the sequence of the nucleic acid chain, the binding nucleotides of the nucleic acid chain, the corresponding InterPro IDs [11] of the protein chain, the GO identifiers of the protein chain and the 3D visualization with labels of contacting residues and nucleotides based on 3Dmol.js [15]. The schematic diagrams of protein–nucleic acid interactions based on NUCPLOT [16] can be obtained by clicking the tab on the webpage.

For the convenience of related study, all binding residues/nucleotides were renumbered according to their corresponding chain sequence. In addition, users can also browse the interactions in the browse page by selecting specific classifications of the protein chains in the menu on the left side.

Methodology

The method for predicting the function of proteins is described in this section. First, the benchmark dataset used is introduced. Then, the method of feature extraction is described. Lastly, the classification of the binding proteins is explained.

Benchmark dataset

The data used in this work was selected from the SwissProt dataset (https://web.expasy.org/docs/swiss-prot_guideline.html). The data in SwissProt contained GO protein sequences, which were selected from the UniProt dataset (https://www.uniprot.org/) with high confidence. The SwissProt dataset was composed of DNA-binding proteins and RNA-binding proteins. The DNA-binding proteins had non-IEA source gene ontologies. The sequences that were more similar were removed using CD-HIT [17]. The similarity degree between the sequences used in our experiments was less than 30%. The benchmark dataset contained five classes. The benchmark dataset used in these experiments is summarized in Table 2. The preprocess of benchmark dataset is shown in Figure 3.

Table 2

Sample detail of benchmark dataset

ClassNo. of samples
DNA-binding transcription factor200
Helicase activity256
Nuclease200
RBP spliceosomal complex199
RBP structural constituent of ribosome213
ClassNo. of samples
DNA-binding transcription factor200
Helicase activity256
Nuclease200
RBP spliceosomal complex199
RBP structural constituent of ribosome213
Table 2

Sample detail of benchmark dataset

ClassNo. of samples
DNA-binding transcription factor200
Helicase activity256
Nuclease200
RBP spliceosomal complex199
RBP structural constituent of ribosome213
ClassNo. of samples
DNA-binding transcription factor200
Helicase activity256
Nuclease200
RBP spliceosomal complex199
RBP structural constituent of ribosome213
The flowchart of preprocess on benchmark dataset.
Figure 3

The flowchart of preprocess on benchmark dataset.

Comparison by Sp.
Figure 4

Comparison by Sp.

Process of random forest model generation Breiman L. Random Forests[J]. Machine Learning, 2001, 45(1):5–32.
Figure 5

Process of random forest model generation Breiman L. Random Forests[J]. Machine Learning, 2001, 45(1):5–32.

Feature extraction

In the literature, residue sequences are usually represented by a vector v before the process of prediction. An efficient feature set is expected to distinguish positive samples and negative samples with high accuracy. The quality of a feature set is critical to the performance of any predictor. In our method, a sequence S is represented by the sequential evolution information and secondary structural information. There are totally 473 features extracted. 400 of them are evolution information, and 73 features are used to represent secondary structural information. The features used in our method were extracted from sequence S. The features include PSI-BLAST features [18] and PSI-PRED features [19]. PSI-BLAST describes the evolutional information, and the secondary structural information is shown by PSI-PRED. The combination of these features has been previously used for protein fold prediction [20].

A protein sequence SL is denoted as SL = {R1,R2, … ,RL} and L is the length of the residue. PSI-BLAST is based on the protein database nrdb90 [21] and a position-specific score matrix (PSSM). A PSSM is a matrix with L × 20, written as Equation (1):
(1)
Mi,j is denoted as the score of the residue at the ith position of SL being mutated to the residue type j during an evolutionary process. The 20 features are computed by the average value of each column (Equation (2)):
(2)
The evolution information is also used for describing a sequence. During the evolutionary process, an amino acid located at the ith position in the residue may be mutated to type j, and the score of it is denoted as Mij (Equation (3)):
(3)

Pj represents the background frequency of residue type j, and the background depends on the average occurrence frequency of all 20 amino acids in each sequence of the protein database PDB25 [22].

|${\delta}_{\mathrm{i}}$| is represented as the residue located at the ith position in an original sequence, SL, replaced by the |${\delta}_{\mathrm{i}}$|th amino acid in the amino acid alphabet. The sequence SL is transferred into a consensus sequence Scon by using |${\delta}_i$|⁠. The frequency of |$\mathrm{each}\ {\delta}_i$| in the sequence is denoted as a feature. Thus, there are 20 amino acids, so 20 features are extracted from each sequence.

Ri represents the ith residue of a peptide. There are 20 types of native amino acids, which means that there are 20 possible values for Ri. Thus, the number of two consecutive amino acids RiRj is 400 (20 × 20), meaning that the number of dimensions describing the occurrence frequency of two constitute amino acids is 400. The 400-dimensional features have been widely used in the literature of bioinformatics, such as Alzheimer’s disease identification [23] and detection of anticancer peptides [24]. Thus, sequence information can be revealed using these 400-dimensional features. In contrast to using 400-dimensional features, the features used in our work were based on the consensus sequence Scon. The frequency of |${\delta}_i{\delta}_j$| is denoted as a feature. The frequency of the occurrence of |${\delta}_i{\delta}_j$| is denoted as a feature in v. Therefore, another 400 features are extracted from each sequence.

PSI-PRED-based features have been widely used in secondary structure prediction. The features include six structure–sequence-based features and a × 3n + 3n structure probability matrix-based features. The value of a is set to be 8, and n is 1. Thus, there are 33 PSI-PRED features. Therefore, there are 473 (20 + 20 + 400 + 33) features used to represent a sequence SL in total.

Classification

Support vector machine, naive Bayes and ensemble methods have all been widely used in bioinformatics, such as prediction of tumor detection [25], function prediction of proteins and disease detection [26]. The performance of a predictor is also related to the classifier used. Thus, an efficient classifier is critical for the performance of a computational predictor. In our work, a random forest model was used to predict the function of proteins.

Random forest models are a type of ensemble classifier. The key idea of random forest model is that a number of decision trees are used together for prediction (Figure 5). The decision trees are trained by the datasets, which are built based on bagging. Each decision tree makes a decision, and the final decision is made by a voting process. A sample is then classified into the class with the most votes. The process of a random forest model is shown in Figure 3.

Results

To demonstrate the efficiency of our proposed model, our proposed method is compared with other methods, which have been used widely in the literature.

188D is proposed to extract features from a sequence [27]. The amino acid composition, distribution and physicochemical property are described in 188D. This method has been used in bioinformatics, such as for the identification of antioxidant proteins [28].

The Kmer (k = 2) method extracts features representing the occurrence frequency of k consecutive amino acid in a residue. In our experiments, k was set to 2. The number of dimensions of Kmer features is (nk) + 1, where n is the length of residue.

PC-PseAAC was proposed based on pseudo amino acid components (PseAAC) and has been used in protein identification [29]. The information on location residue and global residue are mixed into PseAAC in PC-PseAAC.

An autocorrelation (AC) is the correlation between any two residues with distance lag on the same properties [30].

The experiments were based on a 10-fold cross validation. In a 10-fold cross validation, a dataset is divided into 10 parts. Ninety percent of samples are then used for training parameters, and the remaining 10% are considered as testing samples.

The evaluation metric was measured by accuracy (ACC), which was denoted as the rate of correctly classified samples using method G. ACC has been widely used in the literature [31]. The dataset was composed of both positive and negative samples. The result set was divided into four parts, which were true positive (TP), false positive (FP), true negative (TN) and false negative (FN). TP is the number of positive samples classified correctly. FP is the number of negative samples labeled as positive samples. TN denotes the number of negative samples that were labeled correctly. FN is the number of positive samples that were recognized as negative samples. The accuracy (ACC) of method G was computed using Equation (4):
(4)
The comparison of the feature sets is shown in Figure 2F. For the purpose of convincing our work, specificity (Sp) is measured and compared. Specificity is denoted as Equation (5):
(5)

The experimental results (Figures 2F and 4) show that our proposed method performed better than other feature sets. The combination of sequence information features improved the prediction accuracy. In fact, the features used in Kmer (k = 2) were a part of the features in 473D. In the experimental results, the accuracy was improved by nearly 19% (((77.43–65.07)/65.07)%) when the sequences were represented by the PSI-BLAST and PSI-PRED features compared with only using the Kmer (k = 2) features. The sequence information was also extracted in 188D, so the accuracy was 0.69, which outperforms other existing methods except 473D. The accuracy of AC and CC were 0.41 and 0.47, respectively. The accuracy of the combination of AC and CC was 0.4625, which was not as good as other methods, such as 188D. In the experiments, the features of AC and CC were not suitable for predicting the function of proteins. The sequence information and secondary structural information were helpful for improving the accuracy and have been used in our proposed method.

Conclusions

Due to the rapid growth in the number of protein sequences without identification of their functions, a database describing the protein–nucleic acid interactions (PNIDB) was provided in our work. The functions of sequences were labeled using GO identifiers in PNIDB. PNIDB provides a convenient and user-friendly interface to query and browse detailed information on protein–nucleic acid interactions. Different from existing databases, PNIDB focuses on both protein–DNA and protein–RNA interactions, and the functional classifications are considered at the sequence level. Moreover, a benchmark database is available for the prediction of protein–nucleic acid binding events at either the protein residue level or nucleotide level. PNIDB will also aid in the functional prediction of nucleic-binding proteins based on protein sequence and may help for providing putative drug targets and novel therapy options. The problem of classification of DNA-binding proteins and RNA-binding proteins was also considered in this work. The sequences are represented by PSI-BLAST features and PSI-PRED features, and a random forest model was used to predict the type of protein, such as DNA-binding proteins and RNA-binding proteins. The accuracy of our proposed method was 0.774, which performs better than other methods. A web server for protein prediction was provided online for the convenience of other researchers. Above all, PNIDB labeled by gene ontology identifiers was built for describing the function of proteins, and a computational predictor was developed for classifying DNA-binding proteins and RNA-binding proteins.

Key Points
  • A database describing protein–nucleic acid interactions, known as PNIDB, is built and can be accessed.

  • An efficient classifier for predicting DNA-binding proteins and RNA-binding proteins was proposed based on sequence information and secondary structural information, and the accuracy of the proposed method achieved a correlation of 77.43%, which outperforms other methods.

  • A web server for the prediction of protein–nucleic acid was also developed. The web server was used to predict the function of proteins, which can help researchers in studying protein–nucleic acid interactions.

Availability and Implementation

PNIDB is now fully working and can be freely accessed at http://server.malab.cn/PNIDB/index.html. All the data are publicly available for noncommercial use, distribution and reproduction in any medium.

Funding

This work was supported by the Natural Science Foundation of China (Nos. 61902259, 61771331), the Natural Science Foundation of Guangdong Province (grant no. 2018A0303130084) and the Science and Technology Innovation Commission of Shenzhen (grant no. JCYJ20170818100431895).Conflicts of Interest: The authors declare no conflicts of interest.

Lei Xu is an associate professor at the School of Electronic and Communication Engineering, Shenzhen Polytechnic. She received her BSc and MSc from the School of Computer Science and Technology in Harbin Institute of Technology in 2006 and 2008, respectively. She got her PhD degree from the Department of Computing, The Hong Kong Polytechnic University, in October 2013. Her research interests are focused on bioinformatics and pattern recognition.

Shanshan Jiang is a research student in Peking University. She majors in bioinformatics, machine learning and algorithms.

Quan Zou is a professor in Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China. He is a senior member of IEEE and ACM. He majors in bioinformatics, machine learning and algorithms.

Jin Wu is an assistant professor at the School of Management, Shenzhen Polytechnic. Her research interests are focused on bioinformatics and biology.

References

1.

Qu
K
,
Zou
Q
.
A review of DNA-binding proteins prediction methods
.
Curr Bioinformatics
2018
;
14
(
3
).

2.

Gao
X
, et al.
iRBP-motif-PSSM: identification of RNA-binding proteins based on collaborative learning
.
IEEE Access
2019
;
7
:
168956
62
.

3.

Bin
L
,
Xin
G
,
Hanyu
Z
.
BioSeq-Analysis2.0: an updated platform for analyzing DNA, RNA and protein sequences at sequence level and residue level based on machine learning approaches
.
Nucleic Acids Res
2019
(
20
):
20
.

4.

Zhang
J
,
Chen
Q
,
Liu
B
.
DeepDRBP-2L: a new genome annotation predictor for identifying DNA binding proteins and RNA binding proteins using convolutional neural network and long short-term memory
.
IEEE/ACM Trans Comput Biol Bioinform
2019
;
11
.

5.

Norambuena, T.a.M., F
.
The protein-DNA Interface database
.
BMC Bioinformatics
2010
;
11
:
262
.

6.

Kirsanov
DD
,
Zanegina
O
,
Aksianov
EA
, et al.
NPIDB: nucleic acid - protein interaction database
.
Nucleic Acids Res
2013
;(
41
):
D517
23
.

7.

Olga
Z
,
Dmitriy
K
,
Eugene
B
, et al.
An updated version of NPIDB includes new classifications of DNA–protein complexes and their families
.
Nucleic Acids Res
2016
;(
D1
):
D144
53
.

8.

Burley
SK
, et al.
RCSB Protein Data Bank: Sustaining a Living Digital Data Resource that Enables Breakthroughs in Scientific Research and Biomedical Education
.
Protein Science A Publication of the Protein Society
,
2015
.

9.

Skjaerven
L
,
Yao
XQ
,
Scarabelli
G
, et al.
Integrating protein structural dynamics and evolutionary analysis with Bio3D
.
Bmc Bioinformatics
2014
;
15
(
1
):
399
.

10.

Ashburner
M
,
Ball
CA
,
Blake
JA
, et al.
Gene ontology: tool for the unification of biology
.
Nat Genet
2000
;
25
(
1
):
25
9
.

11.

Finn
RDAA
.
InterPro in 2017—beyond protein family and domain annotations
.
Nucleic Acids Res
2017
;
45
(
D1
):
D190
9
.

12.

Carbon
S
.
Expansion of the gene ontology knowledgebase and resources: the gene ontology consortium
.
Nucleic Acids Res
2017
;
45
:
D331
8
.

13.

Velankar
S
.
SIFTS: structure integration with function, taxonomy and sequences resource
.
Nucleic Acids Res
2013
;
41
:
483
9
.

14.

Rolf
A
, et al.
UniProt: the universal protein knowledgebase
.
Nucleic Acids Res
2004
;(
32
):
D115
.

15.

Rego
N
,
Koes
D
.
3Dmol.Js: molecular visualization with WebGL
.
Bioinformatics
2015
;
31
(
8
):
1322
4
.

16.

Luscombe and N
.
NUCPLOT: a program to generate schematic diagrams of protein-nucleic acid interactions
.
Nucleic Acids Res
1997
;
25
(
24
):
4940
5
.

17.

Li
W
,
Godzik
AJB
.
Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences
.
2006
;
22
(
13
):
1658
.

18.

Altschul
SF
, et al.
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
.
Nucleic Acids Res
1997
;
25
(
17
):
3389
402
.

19.

Jones
D
.
Protein secondary structure prediction based on position-specific scoring matrices
.
J Mol Biol
1999
;
292
(
2
):
195
202
.

20.

Wei
L
,
Liao
M
,
Gao
X
, et al.
Enhanced protein fold prediction method through a novel feature extraction technique
.
IEEE Trans Nanobioscience
2015
;
14
(
6
):
649
59
.

21.

Holm
LCS
.
Removing near-neighbour redundancy from large protein sequence collections
.
Bioinformatics
1998
;
14
:
423
9
.

22.

Sussman
JL
,
Lin
D
,
Jiang
J
, et al.
Protein data Bank (PDB): database of three-dimensional structural information of biological macromolecules
.
Acta Crystallogr
2010
;
54
(
6–1
):
1078
84
.

23.

Xu
L
,
Liang
G
,
Liao
C
, et al.
An efficient classifier for Alzheimer’s disease genes identification
.
Molecules
2018
;
23
(
12
):
3140
.

24.

Xu
L
,
Liang
G
,
Liao
C
, et al.
A novel hybrid sequence-based model for identifying anticancer peptides
.
Genes
2018
;
9
(
3
):
158
.

25.

Tang
W
,
Wan
S
,
Yang
Z
, et al.
Tumor origin detection with tissue-specific miRNA and DNA methylation markers
.
Bioinformatics
2018
;
34
(
3
):
398
406
.

26.

Xu
L
,
Liang
G
,
Lian
C
, et al.
K-skip-n-gram-RF: a random forest based method for Alzheimer's disease protein identification
.
Front Genet
2019
;
10
:
33
. DOI:

27.

Cai
CZ
,
Han
LY
,
Ji
ZL
, et al.
SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence
.
Nucleic Acids Res
2003
;
31
(
13
):
3692
7
.

28.

Xu
L
,
Liang
G
,
Shi
S
, et al.
SeqSVM: a sequence-based support vector machine method for identifying antioxidant proteins
.
Int J Mol Sci
2018
;
19
(
6
):
1773
.

29.

Chou
KC
.
Some remarks on protein attribute prediction and pseudo amino acid composition
.
J Theor Biol
2011
;
273
(
1
):
236
47
.

30.

Liu
B
,
Liu
F
,
Fang
L
, et al.
repDNA: a python package to generate various modes of feature vectors for DNA sequences by incorporating user-defined physicochemical properties and sequence-order effects
.
Bioinformatics
2015
;
31
(
8
):
1307
.

31.

Chen
J
,
Peng
H
,
Han
G
, et al.
HOGMMNC: a higher order graph matching with multiple network constraints model for gene–drug regulatory modules identification
.
Bioinformatics
2018
;
35
(
4
):
602
10
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)