-
PDF
- Split View
-
Views
-
Cite
Cite
Lei Xu, Shanshan Jiang, Jin Wu, Quan Zou, An in silico approach to identification, categorization and prediction of nucleic acid binding proteins, Briefings in Bioinformatics, Volume 22, Issue 3, May 2021, bbaa171, https://doi.org/10.1093/bib/bbaa171
- Share Icon Share
Abstract
The interaction between proteins and nucleic acid plays an important role in many processes, such as transcription, translation and DNA repair. The mechanisms of related biological events can be understood by exploring the function of proteins in these interactions. The number of known protein sequences has increased rapidly in recent years, but the databases for describing the structure and function of protein have unfortunately grown quite slowly. Thus, improving such databases is meaningful for predicting protein–nucleic acid interactions. Furthermore, the mechanism of related biological events, such as viral infection or designing novel drug targets, can be further understood by understanding the function of proteins in these interactions. The information for each sequence, including its function and interaction sites, were collected and identified, and a database called PNIDB was built. The proteins in PNIDB were grouped into 27 classes, such as transcription, immune system, and structural protein, etc. The function of each protein was then predicted using a machine learning method. Using our method, the predictor was trained on labeled sequences, and then the function of a protein was predicted based on the trained classifier. The prediction accuracy achieved a score of 77.43% by 10-fold cross validation.
Introduction
As the chief actors within the cells, proteins are involved in many essential activities in the cell, and the interactions between proteins with nucleic acid are extremely important for many cellular processes, such as transcription, translation and DNA repair. Therefore, the study of protein–nucleic acid binding activities can help with understanding protein interaction networks or even the mechanism of related cellular processes. With the development of sequencing technology, the amount of protein sequence information has increased rapidly over recent years, but the growth rate of databases describing the structure and function of proteins has been very slow, and cannot match the growth rate of protein sequence databases. Thus, it is essential to narrow the gap between sequence database size and functional database size by characterizing nucleic acid binding proteins and their functional groups. Therefore, a database called PNIDB (Protein–Nucleic Acid Interactions Database, PNIDB for short), in which DNA-binding proteins and RNA-binding proteins are denoted by gene ontology, was built. Sequence information was extracted, and an efficient classification method for predicting DNA-binding and RNA-binding proteins is proposed in this article. The process of our work is summarized in Figure 1.

As shown in Figure 1, the dataset was built including DNA-binding and RNA-binding proteins. Each sample was represented by a 473-dimensional vector describing the sequence information and secondary structural information. The parameters of a random forest model were trained using the training samples. The testing samples were then classified by the random forest model with the previously learned parameters.
There are studies focusing on identifying nucleic acid binding proteins and non-nucleic acid binding proteins, and the accuracy of existing methods has been improved over time [1]. However, these methods cannot distinguish the type of binding proteins, such as DNA-binding proteins or RNA-binding proteins, so it is meaningful to be able to predict the functions of these proteins. Based on existing research, our work focused on identifying and distinguishing DNA-binding proteins and RNA-binding proteins. Since the secondary structure of RNA is diverse, it is difficult to identify common characteristics of RNA-binding proteins [2, 3]. As a result, the problem of identifying RNA-binding proteins has been rarely considered in previous studies. Thus, using current methods, RNA-binding proteins are likely to be recognized as DNA-binding proteins [4]. In fact, the function of DNA-binding proteins is quite different from the function of RNA-binding proteins; thus, the study of distinguishing DNA-binding proteins and RNA-binding proteins should be considered. In our current study, RNA-binding proteins were divided into different classes depending on their characteristics, and the identification of DNA-binding proteins was discussed. Furthermore, for the purpose of revealing the biological functions of binding proteins in cellular activities, a protein–nucleic acid binding database called PNIDB was created and can be accessed online. The information, such as functional classification of protein chain and binding events at the sequence level, are all described in PNIDB. Moreover, the database can predict protein–nucleic acid binding events with the information given by PNIDB.
In contrast to previous work, the nucleic–acid binding proteins were further divided into RNA-binding proteins and DNA-binding proteins in the proposed method, and both of them were identified by gene ontology in PNIDB. The rationale was that the functions of proteins can be learned thoroughly when the proteins are identified precisely. The contributions of our work include:
(1) A database describing protein–nucleic acid interactions, known as PNIDB, which can be accessed by researchers at the website http://server.malab.cn/PNIDB/index.html. The information provided by PNIDB can help reveal the biological functions of binding proteins in cellular activities. The proteins in PNIDB are labeled by gene ontology identifiers.
(2) An efficient classifier for predicting DNA-binding proteins and RNA-binding proteins was proposed based on sequence information and secondary structural information, and the accuracy of the proposed method achieved a correlation of 77.43%, which outperforms other methods. The experimental results demonstrated that the combined information could improve the prediction accuracy of our method.
(3) A web server for the prediction of protein–nucleic acid was also developed. The web server was used to predict the function of proteins, which can help researchers in studying in protein–nucleic acid interactions.
In the rest of the paper, the usage manual of PNIDB is introduced first, and then the proposed methodology for the classification of functional proteins is illustrated in detail. The performance of our proposed method is demonstrated by experimental results. Finally, a conclusion is made at the end of the manuscript.
Usage of PNIDB
There are other databases that provide information regarding protein–DNA/RNA interactions. For example, the protein–DNA Interface database (PDIdb) [5] is a repository containing structural information for 922 protein–DNA complexes with a resolution of 2.5 Å or more (while in fact there are 2396 this kind of complexes in the database). The Nucleic Acid–Protein Interaction database (NPIDB) [6] [7] contains structural classifications and detailed information on both DNA–protein and RNA–protein complexes extracted from PDB. The current version of NPIDB contains 5046 structures, while PNIDB contains 6228 PDB structures overall. In contrast to the above databases, proteins are denoted by gene ontology in PNIDB.
PNIDB provides detailed atom-based interaction information. The significance of PNIDB is that it specifically focuses on sequence-level annotations and provides functional clustering, which should be of benefit for sequence-based research and functional prediction of protein–nucleic acid interactions.
PNIDB is a repository of protein–nucleic acid interaction information derived from 6798 nucleic acid-containing structures collected from the Protein Data Bank [8]. The Bio3D [9] package was used to read, analyze and manipulate PDB structures in R (http://www.R-project.org/). For each PDB file, we identified the proteins, and the nucleic acid chains were then extracted. Moreover, possible binding residues of the protein chain, which were defined as the residues with at least one atom within 5 Å from any nucleic acid atom, and corresponding binding nucleotides of the nucleic acid chain were also calculated in PNIDB. PNIDB can be accessed at the website http://server.malab.cn:8080/PNIDB/Statistics.html. There are totally 36 425 interactions in PNIDB, in which 40% of them are protein–RNA interactions and 60% of them are protein–DNA interactions. The information of PNIDB are summarized in Table 1. The class information of DNA-binding interactions and RNA-binding interactions are shown in Table 1.
Type of interaction . | Class . | No. of samples . |
---|---|---|
DNA-binding interactions | Cell cycle | 101 |
Gene regulation | 490 | |
Hydrolase | 3717 | |
Immune system | 234 | |
Isomerase | 635 | |
Ligase | 209 | |
Lyase | 552 | |
Metal-binding protein | 108 | |
Nuclear protein | 51 | |
Recombination | 312 | |
Oxidoreductase | 112 | |
Replication | 418 | |
Structural protein | 2161 | |
Transcription | 5013 | |
Transferase | 4727 | |
Transport protein | 37 | |
Viral protein | 185 | |
RNA-binding interaction | Splicing | 470 |
Ligase | 211 | |
Isomerase | 70 | |
Immune system | 223 | |
Hydrolase | 818 | |
Ribosome | 8409 | |
Transcription | 645 | |
Transferase | 957 | |
Translation | 468 | |
Viral protein | 411 |
Type of interaction . | Class . | No. of samples . |
---|---|---|
DNA-binding interactions | Cell cycle | 101 |
Gene regulation | 490 | |
Hydrolase | 3717 | |
Immune system | 234 | |
Isomerase | 635 | |
Ligase | 209 | |
Lyase | 552 | |
Metal-binding protein | 108 | |
Nuclear protein | 51 | |
Recombination | 312 | |
Oxidoreductase | 112 | |
Replication | 418 | |
Structural protein | 2161 | |
Transcription | 5013 | |
Transferase | 4727 | |
Transport protein | 37 | |
Viral protein | 185 | |
RNA-binding interaction | Splicing | 470 |
Ligase | 211 | |
Isomerase | 70 | |
Immune system | 223 | |
Hydrolase | 818 | |
Ribosome | 8409 | |
Transcription | 645 | |
Transferase | 957 | |
Translation | 468 | |
Viral protein | 411 |
Type of interaction . | Class . | No. of samples . |
---|---|---|
DNA-binding interactions | Cell cycle | 101 |
Gene regulation | 490 | |
Hydrolase | 3717 | |
Immune system | 234 | |
Isomerase | 635 | |
Ligase | 209 | |
Lyase | 552 | |
Metal-binding protein | 108 | |
Nuclear protein | 51 | |
Recombination | 312 | |
Oxidoreductase | 112 | |
Replication | 418 | |
Structural protein | 2161 | |
Transcription | 5013 | |
Transferase | 4727 | |
Transport protein | 37 | |
Viral protein | 185 | |
RNA-binding interaction | Splicing | 470 |
Ligase | 211 | |
Isomerase | 70 | |
Immune system | 223 | |
Hydrolase | 818 | |
Ribosome | 8409 | |
Transcription | 645 | |
Transferase | 957 | |
Translation | 468 | |
Viral protein | 411 |
Type of interaction . | Class . | No. of samples . |
---|---|---|
DNA-binding interactions | Cell cycle | 101 |
Gene regulation | 490 | |
Hydrolase | 3717 | |
Immune system | 234 | |
Isomerase | 635 | |
Ligase | 209 | |
Lyase | 552 | |
Metal-binding protein | 108 | |
Nuclear protein | 51 | |
Recombination | 312 | |
Oxidoreductase | 112 | |
Replication | 418 | |
Structural protein | 2161 | |
Transcription | 5013 | |
Transferase | 4727 | |
Transport protein | 37 | |
Viral protein | 185 | |
RNA-binding interaction | Splicing | 470 |
Ligase | 211 | |
Isomerase | 70 | |
Immune system | 223 | |
Hydrolase | 818 | |
Ribosome | 8409 | |
Transcription | 645 | |
Transferase | 957 | |
Translation | 468 | |
Viral protein | 411 |
Protein chains in interaction pairs were classified according to their mmCIF keywords, interaction type and gene ontology [10] terms. There were a total of 84 753 chains extracted from those structures, in which 20 927 chains contained nucleic acid binding residues. All the protein chains were clustered into 27 functional groups, with 17 kinds of DNA-binding proteins and 10 kinds of RNA-binding proteins. Moreover, each protein chain in these interactions was linked to their respective accession numbers from UniProt as well as the corresponding InterPro identifiers [11] and GO identifiers [10, 12] mapped from the SIFTS project [13].
For convenience, the residues and nucleotides are cited by their relative position in the sequence of their separate chains. In addition, the 2D and 3D visualization interfaces are provided online. Figure 2A and B shows the interfaces of the 2D and 3D visualization, respectively, in PNIDB. In Figure 2B, the visualization interface focuses on nucleic acid sequences. The binding protein residue and the position are highlighted in the figure. The residues exceeding 3.9 Å are considered binding residues.

(A) 3D visualization interface of the interaction between a protein and a nucleic acid residue. (B) 2D visualization interface of an interaction (solid line for hydrogen bond). (C) The web server for binding protein prediction. (D) Results provided by the web-based predictor. (E) Search page on the web server. (F) Comparison with other methods by ACC. (nucleic acids are labeled in red, while protein residues are in gray, and in the 2D visualization, the dash line denotes residues within 3.9 Å).
A search page is also provided online, and users can search by keywords. A quick search and advanced search were implemented. In quick search mode, users can start a search by specifying a keyword, such as a PDB ID [8] organism, interaction type, classification or Uniprot accession number [14]. In the advanced search mode, for the convenience of other researchers, a web server for binding protein prediction was developed. The web server can handle up to 10 fasta sequences at the same time. Then, the results are returned by email.
The web server is shown in Figure 2C. The results predicted by a learned classifier are shown in Figure 2D. The search page of PNIDB is provided, which is shown in Figure 2E. There are three options, ‘search by interaction type,’ ‘search by organism’ and ‘search by classification.’ Users can search for proteins by describing the requirements. Moreover, proteins can be searched by combining several parameters simultaneously. The matching results will be retrieved when the requirements are submitted. Furthermore, more information can be referred using PNIDB, such as the molecule name of the protein chain, the sequence of the protein chain, the binding residues of the protein chain, the sequence of the nucleic acid chain, the binding nucleotides of the nucleic acid chain, the corresponding InterPro IDs [11] of the protein chain, the GO identifiers of the protein chain and the 3D visualization with labels of contacting residues and nucleotides based on 3Dmol.js [15]. The schematic diagrams of protein–nucleic acid interactions based on NUCPLOT [16] can be obtained by clicking the tab on the webpage.
For the convenience of related study, all binding residues/nucleotides were renumbered according to their corresponding chain sequence. In addition, users can also browse the interactions in the browse page by selecting specific classifications of the protein chains in the menu on the left side.
Methodology
The method for predicting the function of proteins is described in this section. First, the benchmark dataset used is introduced. Then, the method of feature extraction is described. Lastly, the classification of the binding proteins is explained.
Benchmark dataset
The data used in this work was selected from the SwissProt dataset (https://web.expasy.org/docs/swiss-prot_guideline.html). The data in SwissProt contained GO protein sequences, which were selected from the UniProt dataset (https://www.uniprot.org/) with high confidence. The SwissProt dataset was composed of DNA-binding proteins and RNA-binding proteins. The DNA-binding proteins had non-IEA source gene ontologies. The sequences that were more similar were removed using CD-HIT [17]. The similarity degree between the sequences used in our experiments was less than 30%. The benchmark dataset contained five classes. The benchmark dataset used in these experiments is summarized in Table 2. The preprocess of benchmark dataset is shown in Figure 3.
Class . | No. of samples . |
---|---|
DNA-binding transcription factor | 200 |
Helicase activity | 256 |
Nuclease | 200 |
RBP spliceosomal complex | 199 |
RBP structural constituent of ribosome | 213 |
Class . | No. of samples . |
---|---|
DNA-binding transcription factor | 200 |
Helicase activity | 256 |
Nuclease | 200 |
RBP spliceosomal complex | 199 |
RBP structural constituent of ribosome | 213 |
Class . | No. of samples . |
---|---|
DNA-binding transcription factor | 200 |
Helicase activity | 256 |
Nuclease | 200 |
RBP spliceosomal complex | 199 |
RBP structural constituent of ribosome | 213 |
Class . | No. of samples . |
---|---|
DNA-binding transcription factor | 200 |
Helicase activity | 256 |
Nuclease | 200 |
RBP spliceosomal complex | 199 |
RBP structural constituent of ribosome | 213 |


![Process of random forest model generation Breiman L. Random Forests[J]. Machine Learning, 2001, 45(1):5–32.](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/bib/22/3/10.1093_bib_bbaa171/1/m_bbaa171f5.jpeg?Expires=1749236659&Signature=0f8QPNw6tQ9QexiU5d3bY8uv5UQIz7g9mkiEo61eRSq54Je5NZ2mvZTdA87rOuhjgZ3oGWO7ZPN0xtFZddPzIIIVj-vzhtoK7g9XlEtOxqNXb95rN6xzi7Ek9ofwbDehqvUZPbFs9sGfQM9E9DjO0L-qpOxSLvver39jun5oWeKle021mHUi9bHGnfP6Hb1~RNwTpK~cJDkxokX6RGEJ28ML-HpnD1Lh1hC1bzcSiQCrZAeag3ANjr67sMxYRucg0eniOJAJNwTd3Sm7XJ0xFm1tw789De7kUgKMS7OLLTZHJbAF494yEqrby11urRAKZEIIfAxyImmj0PuW11TKvQ__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Process of random forest model generation Breiman L. Random Forests[J]. Machine Learning, 2001, 45(1):5–32.
Feature extraction
In the literature, residue sequences are usually represented by a vector v before the process of prediction. An efficient feature set is expected to distinguish positive samples and negative samples with high accuracy. The quality of a feature set is critical to the performance of any predictor. In our method, a sequence S is represented by the sequential evolution information and secondary structural information. There are totally 473 features extracted. 400 of them are evolution information, and 73 features are used to represent secondary structural information. The features used in our method were extracted from sequence S. The features include PSI-BLAST features [18] and PSI-PRED features [19]. PSI-BLAST describes the evolutional information, and the secondary structural information is shown by PSI-PRED. The combination of these features has been previously used for protein fold prediction [20].
Pj represents the background frequency of residue type j, and the background depends on the average occurrence frequency of all 20 amino acids in each sequence of the protein database PDB25 [22].
|${\delta}_{\mathrm{i}}$| is represented as the residue located at the ith position in an original sequence, SL, replaced by the |${\delta}_{\mathrm{i}}$|th amino acid in the amino acid alphabet. The sequence SL is transferred into a consensus sequence Scon by using |${\delta}_i$|. The frequency of |$\mathrm{each}\ {\delta}_i$| in the sequence is denoted as a feature. Thus, there are 20 amino acids, so 20 features are extracted from each sequence.
Ri represents the ith residue of a peptide. There are 20 types of native amino acids, which means that there are 20 possible values for Ri. Thus, the number of two consecutive amino acids RiRj is 400 (20 × 20), meaning that the number of dimensions describing the occurrence frequency of two constitute amino acids is 400. The 400-dimensional features have been widely used in the literature of bioinformatics, such as Alzheimer’s disease identification [23] and detection of anticancer peptides [24]. Thus, sequence information can be revealed using these 400-dimensional features. In contrast to using 400-dimensional features, the features used in our work were based on the consensus sequence Scon. The frequency of |${\delta}_i{\delta}_j$| is denoted as a feature. The frequency of the occurrence of |${\delta}_i{\delta}_j$| is denoted as a feature in v. Therefore, another 400 features are extracted from each sequence.
PSI-PRED-based features have been widely used in secondary structure prediction. The features include six structure–sequence-based features and a × 3n + 3n structure probability matrix-based features. The value of a is set to be 8, and n is 1. Thus, there are 33 PSI-PRED features. Therefore, there are 473 (20 + 20 + 400 + 33) features used to represent a sequence SL in total.
Classification
Support vector machine, naive Bayes and ensemble methods have all been widely used in bioinformatics, such as prediction of tumor detection [25], function prediction of proteins and disease detection [26]. The performance of a predictor is also related to the classifier used. Thus, an efficient classifier is critical for the performance of a computational predictor. In our work, a random forest model was used to predict the function of proteins.
Random forest models are a type of ensemble classifier. The key idea of random forest model is that a number of decision trees are used together for prediction (Figure 5). The decision trees are trained by the datasets, which are built based on bagging. Each decision tree makes a decision, and the final decision is made by a voting process. A sample is then classified into the class with the most votes. The process of a random forest model is shown in Figure 3.
Results
To demonstrate the efficiency of our proposed model, our proposed method is compared with other methods, which have been used widely in the literature.
188D is proposed to extract features from a sequence [27]. The amino acid composition, distribution and physicochemical property are described in 188D. This method has been used in bioinformatics, such as for the identification of antioxidant proteins [28].
The Kmer (k = 2) method extracts features representing the occurrence frequency of k consecutive amino acid in a residue. In our experiments, k was set to 2. The number of dimensions of Kmer features is (n – k) + 1, where n is the length of residue.
PC-PseAAC was proposed based on pseudo amino acid components (PseAAC) and has been used in protein identification [29]. The information on location residue and global residue are mixed into PseAAC in PC-PseAAC.
An autocorrelation (AC) is the correlation between any two residues with distance lag on the same properties [30].
The experiments were based on a 10-fold cross validation. In a 10-fold cross validation, a dataset is divided into 10 parts. Ninety percent of samples are then used for training parameters, and the remaining 10% are considered as testing samples.
The experimental results (Figures 2F and 4) show that our proposed method performed better than other feature sets. The combination of sequence information features improved the prediction accuracy. In fact, the features used in Kmer (k = 2) were a part of the features in 473D. In the experimental results, the accuracy was improved by nearly 19% (((77.43–65.07)/65.07)%) when the sequences were represented by the PSI-BLAST and PSI-PRED features compared with only using the Kmer (k = 2) features. The sequence information was also extracted in 188D, so the accuracy was 0.69, which outperforms other existing methods except 473D. The accuracy of AC and CC were 0.41 and 0.47, respectively. The accuracy of the combination of AC and CC was 0.4625, which was not as good as other methods, such as 188D. In the experiments, the features of AC and CC were not suitable for predicting the function of proteins. The sequence information and secondary structural information were helpful for improving the accuracy and have been used in our proposed method.
Conclusions
Due to the rapid growth in the number of protein sequences without identification of their functions, a database describing the protein–nucleic acid interactions (PNIDB) was provided in our work. The functions of sequences were labeled using GO identifiers in PNIDB. PNIDB provides a convenient and user-friendly interface to query and browse detailed information on protein–nucleic acid interactions. Different from existing databases, PNIDB focuses on both protein–DNA and protein–RNA interactions, and the functional classifications are considered at the sequence level. Moreover, a benchmark database is available for the prediction of protein–nucleic acid binding events at either the protein residue level or nucleotide level. PNIDB will also aid in the functional prediction of nucleic-binding proteins based on protein sequence and may help for providing putative drug targets and novel therapy options. The problem of classification of DNA-binding proteins and RNA-binding proteins was also considered in this work. The sequences are represented by PSI-BLAST features and PSI-PRED features, and a random forest model was used to predict the type of protein, such as DNA-binding proteins and RNA-binding proteins. The accuracy of our proposed method was 0.774, which performs better than other methods. A web server for protein prediction was provided online for the convenience of other researchers. Above all, PNIDB labeled by gene ontology identifiers was built for describing the function of proteins, and a computational predictor was developed for classifying DNA-binding proteins and RNA-binding proteins.
A database describing protein–nucleic acid interactions, known as PNIDB, is built and can be accessed.
An efficient classifier for predicting DNA-binding proteins and RNA-binding proteins was proposed based on sequence information and secondary structural information, and the accuracy of the proposed method achieved a correlation of 77.43%, which outperforms other methods.
A web server for the prediction of protein–nucleic acid was also developed. The web server was used to predict the function of proteins, which can help researchers in studying protein–nucleic acid interactions.
Availability and Implementation
PNIDB is now fully working and can be freely accessed at http://server.malab.cn/PNIDB/index.html. All the data are publicly available for noncommercial use, distribution and reproduction in any medium.
Funding
This work was supported by the Natural Science Foundation of China (Nos. 61902259, 61771331), the Natural Science Foundation of Guangdong Province (grant no. 2018A0303130084) and the Science and Technology Innovation Commission of Shenzhen (grant no. JCYJ20170818100431895).Conflicts of Interest: The authors declare no conflicts of interest.
Lei Xu is an associate professor at the School of Electronic and Communication Engineering, Shenzhen Polytechnic. She received her BSc and MSc from the School of Computer Science and Technology in Harbin Institute of Technology in 2006 and 2008, respectively. She got her PhD degree from the Department of Computing, The Hong Kong Polytechnic University, in October 2013. Her research interests are focused on bioinformatics and pattern recognition.
Shanshan Jiang is a research student in Peking University. She majors in bioinformatics, machine learning and algorithms.
Quan Zou is a professor in Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China. He is a senior member of IEEE and ACM. He majors in bioinformatics, machine learning and algorithms.
Jin Wu is an assistant professor at the School of Management, Shenzhen Polytechnic. Her research interests are focused on bioinformatics and biology.