-
PDF
- Split View
-
Views
-
Cite
Cite
Kai-Yao Huang, Hui-Ju Kao, Tzu-Hsiang Weng, Chia-Hung Chen, Shun-Long Weng, iDVIP: identification and characterization of viral integrase inhibitory peptides, Briefings in Bioinformatics, Volume 23, Issue 6, November 2022, bbac406, https://doi.org/10.1093/bib/bbac406
- Share Icon Share
Abstract
Antiretroviral peptides are a kind of bioactive peptides that present inhibitory activity against retroviruses through various mechanisms. Among them, viral integrase inhibitory peptides (VINIPs) are a class of antiretroviral peptides that have the ability to block the action of integrase proteins, which is essential for retroviral replication. As the number of experimentally verified bioactive peptides has increased significantly, the lack of in silico machine learning approaches can effectively predict the peptides with the integrase inhibitory activity. Here, we have developed the first prediction model for identifying the novel VINIPs using the sequence characteristics, and the hybrid feature set was considered to improve the predictive ability. The performance was evaluated by 5-fold cross-validation based on the training dataset, and the result indicates the proposed model is capable of predicting the VINIPs, with a sensitivity of 85.82%, a specificity of 88.81%, an accuracy of 88.37%, a balanced accuracy of 87.32% and a Matthews correlation coefficient value of 0.64. Most importantly, the model also consistently provides effective performance in independent testing. To sum up, we propose the first computational approach for identifying and characterizing the VINIPs, which can be considered novel antiretroviral therapy agents. Ultimately, to facilitate further research and development, iDVIP, an automatic computational tool that predicts the VINIPs has been developed, which is now freely available at http://mer.hc.mmh.org.tw/iDVIP/.
Introduction
Acquired immune deficiency syndrome (AIDS) is a life-threatening disease caused by the human immunodeficiency virus (HIV), which refers to a type of retrovirus. HIV/AIDS is still a major burden of disease in low-income countries worldwide in 2020, which is the most frequent cause of death in many countries besides ischemic heart disease [1]. Some chemotherapy drugs can temporarily control the HIV infection, which can act as directing-acting antiviral agents (DAAs) by targeting viral components in the viral life cycle [2, 3], and some of them can exert their antiviral activities mainly by inhibiting host factors that are essential for viral replication and immunomodulation such as host-targeting agents [4, 5]. For the retrovirus, the integration of the retroviral genome into the chromatin of the infected host cell is an imperative step in the replication cycle, and the process is catalyzed by the virus-encoded integrase protein (IN) [6]. The retroviral IN is a key component of the pre-integration complex involved in several steps of retrovirus replication, such as reverse transcription, nuclear import, chromatin targeting and integration, and thus can be a potential therapeutic target of DAAs in retrovirus infection [7].
Over the past 15 years, some drugs have been designed and used to specifically inhibit the IN protein; these small molecule inhibitors belong to a class of chemical compounds called integrase strand transfer inhibitors (INSTIs), including raltegravir, elvitegravir, dolutegravir, bictegravir and cabotegravir [8]. Because of the high potency, low toxicity and good tolerability, these INSTIs are one of the most popular drugs for the treatment of people with retroviral infection [9]. However, most of them carry the risk of complications and side effects, including neuropsychiatric symptoms, weight gain, glucose and lipid metabolic disorders, type II diabetes mellitus, liver abnormalities, renal adverse effects, gastrointestinal symptoms, pancreatic injury, rhabdomyolysis and even Stevens–Johnson syndrome [10–15]. Therefore, the routine monitoring of kidney function via blood and urine tests is recommended for anyone on an INSTI-based regimen. In addition, although highly efficacious in patients with HIV infection, some studies have reported that taking INSTIs may lead to drug-resistant strains of HIV [16].
Since the advances in structural biology and analytic technologies, the process of peptide drug development has significantly accelerated in recent decades. The key advantage of peptides over small molecules is their larger surface area and greater chiral and structural complexity; thus, peptide drugs have meanwhile become the fastest growing class of new pharmaceutics that is used for treating chronic disease, cancers and other infectious diseases caused by microorganisms [17]. More than 80 FDA-approved peptide drugs have been developed for a wide range of diseases, including diabetes, cancer, osteoporosis, multiple sclerosis and chronic pain [18]. Among them, enfuvirtide is a synthetic peptide that acts as a synthetic antiretroviral agent by interfering with the fusion of viral and host cellular membranes [19]. Unfortunately, there is still no available peptide drug for the treatment of retroviral infection via inhibiting the viral IN protein.
It is clear that the field of peptide-based therapeutics has grown significantly over the past few decades, yet there remains much scope for improvement and eventually deserves to be widely used in clinical applications [20]. There are many intrinsic limitations in peptide-based drugs such as metabolic stability, pH and temperature instability, toxic profiles, functional promiscuity, target specificity, binding affinity and so forth. Notably, low membrane permeability, compared to small molecule drugs, peptides lack the advantage of size which makes them more difficult to penetrate the cell membrane and provide access to the intracellular components [21]. And besides, in the absence of well-defined secondary and tertiary structures, the amide bonds of short linear peptides could be easily hydrolyzed to amino acids by enzymes that strongly influence peptide stability, with a short half-life in vivo [22]. However, the clinical use of small molecules is also limited by the low specificity and high toxicity compared with peptide drugs [23]. By contrast, the physical and chemical properties of peptides render them a novel category of drugs, including high bioactivity and specificity, strong solubility and low toxicity [21, 24].
Nowadays, bioinformatics approaches have been widely applied to various aspects of the drug development process, and it has even been proposed that using machine learning tools to identify drug candidates could reduce costs by up to 50% [25]. In recent years, some machine learning approaches have been developed to discover the peptides with functional activity against viruses. Lissabet et al. have developed a bioinformatic software called Anti-VPP [26] that focused on the assessment of antiviral peptide (AVP) candidates, which uses several physicochemical properties for training and validation of the Random Forest model. Meta-iACP [27] was claimed as the first meta-based approach for the prediction of AVPs, a set of prediction scores was derived from multiple algorithms and various types of features, which can provide an accurate prediction from given sequences. To improve the predictive performance, ENNAVIA [28], a novel deep neural network classifier for the prediction of antiviral activity was designed by Timmons and Hewage, which leverages advances in deep learning and cheminformatics of AVPs and yields effective results in the external test. Additionally, AVP-IC50 Pred [29], a regression-based method has been developed to predict the antiviral activity of peptides based on their IC50 values, which has good correlation coefficient values using experimentally proven datasets by employing multiple machine learning algorithms. Subsequently, Pang et al. have proposed a double-stage AVP classification web tool named AVPIden [30], the first stage was designed to distinguish the AVPs from the non-AVPs, and then the second stage is responsible for determining their potential functions against six virus families and eight kinds of viruses. Although some methods have been developed for identifying the peptides with antiviral or antiretroviral activities, no study has specifically focused on the discovery of the peptides with integrase inhibitory activity so far.
Viral integrase inhibitory peptides (VINIPs) are a class of antiretroviral peptides (ARVPs) that have the ability to block the action of integrase; they can potentially be used in antiretroviral therapy (ART). However, to the best of our knowledge, no previous work has focused on constructing a model to predict the integrase inhibitory peptides, and their sequence and structural characteristics are still poorly investigated. In addition, driven by the advances in mass spectrometry instrumentation and analytic methodologies, the number of peptide sequences collected in the public database has been growing exponentially for decades. Therefore, in this study, we then attempted to perform a comprehensive analysis of the characteristics and develop the first model for identifying the novel peptides with the inhibitory activity of IN proteins.
Results and discussion
The workflow is shown in Figure 1; the details of each step are depicted as follows: data collection and preprocessing, investigation of the features, construction of the VINIP prediction model, performance evaluation of the proposed model, independent testing, construction of the INI-type classification model and implementation of a web-based tool. The details of each step are depicted as follows.

Flowchart of identification and characterization of the VINIPs.
The statistics of experimental data
Several databases have been established to provide the information and sequences of AVPs, including APD3 [31], AVPdb [32], CAMPR3 [33], DBAASP v3 [34], dbAMP [35], DRAMP 2.0 [36] and SATPdb [37], but HIPdb [38] is the only database that contains the peptide sequences with the annotation of integrase inhibitory activity. As shown in Table 1, a total of 203 experimentally verified VINIP sequences were collected from HIPdb as the positive dataset, and 1010 verified non-VINIP sequences in the AVPdb and HIPdb databases were collected as the negative dataset in this study. In addition, the organism sources of each peptide are also collected; the detailed information is presented in Supplementary Table S1 available online at http://bib.oxfordjournals.org/. On the other hand, to further investigate the differences in the properties between the VINIPs and AVPs, and the other ARVPs that block the retroviral infection by other mechanisms, a total of 603 reviewed AVPs and ARVPs were obtained from HIPdb as the other negative dataset. According to the functional annotation from the databases, except for the integrase inhibitory activity, the ARVP dataset was composed of peptides that have broad activities against various viruses through multiple mechanisms. For example, some of them are known as fusion inhibitors that can prevent measles morbillivirus (MeV), human parainfluenza virus (HPIV), BK virus (BKV) and Epstein–Barr virus (EBV) fusing with the host cell, which also exhibited as replication inhibitors that restrict the viral replication of vaccinia virus (VACV), hepatitis C virus (HCV), Japanese encephalitis virus (JEV), influenza A virus (IAV), influenza B virus (IBV), influenza C virus (ICV) and Sendai virus (SeV). Additionally, some of these peptides can block viral entry and further prevent infection of deadliest viruses, including SARS coronavirus (SARS-CoV), dengue virus (DENV), Ebola virus (EBOV), cowpox virus (CPXV), monkeypox virus (MPXV), human papillomavirus (HPV) and HIV. Briefly, the peptides in the ARVP dataset have not only antiretroviral activity but also cytotoxicity on various other viruses. After removing the redundancy (see Table 1), a total of 110 VINIPs and 640 non-VINIPs were combined into the training set for developing the VINIP prediction model, and 12 VINIPs and 71 non-VINIPs were assigned to the positive and negative groups as the testing dataset, respectively. Moreover, the final amounts of ARVP sequence records were also divided into two datasets for the construction of the INI-type prediction model, 459 of the data were split into the training set and 51 as the testing data.
Dataset . | Number of VINIPs . | Number of non-VINIPs . | Number of AVPs/ARVPs . |
---|---|---|---|
Raw data | 203 | 1010 | 603 |
VINIP Training dataset | 110 | 640 | - |
VINIP Testing dataset | 12 | 71 | - |
INI-type Training dataset | 110 | - | 459 |
INI-type Testing dataset | 12 | - | 51 |
Dataset . | Number of VINIPs . | Number of non-VINIPs . | Number of AVPs/ARVPs . |
---|---|---|---|
Raw data | 203 | 1010 | 603 |
VINIP Training dataset | 110 | 640 | - |
VINIP Testing dataset | 12 | 71 | - |
INI-type Training dataset | 110 | - | 459 |
INI-type Testing dataset | 12 | - | 51 |
Dataset . | Number of VINIPs . | Number of non-VINIPs . | Number of AVPs/ARVPs . |
---|---|---|---|
Raw data | 203 | 1010 | 603 |
VINIP Training dataset | 110 | 640 | - |
VINIP Testing dataset | 12 | 71 | - |
INI-type Training dataset | 110 | - | 459 |
INI-type Testing dataset | 12 | - | 51 |
Dataset . | Number of VINIPs . | Number of non-VINIPs . | Number of AVPs/ARVPs . |
---|---|---|---|
Raw data | 203 | 1010 | 603 |
VINIP Training dataset | 110 | 640 | - |
VINIP Testing dataset | 12 | 71 | - |
INI-type Training dataset | 110 | - | 459 |
INI-type Testing dataset | 12 | - | 51 |
Investigation of the composition of amino acids
Several studies have shown that the sequence-based features are very effective for the prediction of protein and peptide functions; most of these features were used for discriminating the VINIPs from the non-VINIPs and the ARVPs in this work, including the composition of amino acids (AAC) [39], dipeptides (DPC) [40] and k-spaced amino acid pairs (CKSAAP) [41]. After the data preprocessing, the frequency of occurrence of each of the 20 amino acids was calculated to investigate the components of the sequences. Figure 2 shows the comparison of the composition of the essential amino acids among the VINIP, non-VINIP and ARVP sequences, and P-values from the Student’s t-test are shown in Supplementary Table S2 available online at http://bib.oxfordjournals.org/, which indicates the enrichment of hydrophobic residues A (Alanine) and I (Isoleucine) in the VINIPs than in the others. The aromatic amino acids F (Phenylalanine) and W (Tryptophan) also occur more frequently in the VINIPs, and a low level of the aliphatic and non-polar amino acid P (Proline) was identified. The amino acid Y (Tyrosine) exhibits both polar and hydrophilicity characteristics, which has a higher frequency of occurrence in the VINIPs; in contrast, the result shows that the amino acids S (Serine) and T (Threonine) are scarce in the VINIPs. In addition, the charge-neutral and polar amino acid Q (Glutamine) shows a higher occurrence rate in all ARVPs (both in VINIP and ARVP sequences) compared to the non-VINIPs; conversely, the amino acid C (Cysteine) shows a lower occurrence rate. It is worth mentioning that the result also shows a lower proportion of the negatively charged amino acid E (Glutamic acid) in the VINIPs compared with the other ARVPs, which is an important feature to discriminate the ARVPs with integrase inhibitory activity from the others.

Comparison of the amino acid composition among the VINIPs, non-VINIPs and ARVPs.
Investigation of the composition of k-spaced amino acid pairs
By analyzing the frequencies of occurrence of amino acid pairs, we can estimate the relative significance and capture the characteristics of these peptides. For each peptide sequence, we measured the composition of amino acid pairs that are separated by k number of other residues, and k = 0, 1, 2, and 3 were considered in the analysis, respectively. As shown in Figure 3 and Supplementary Figure S1 available online at http://bib.oxfordjournals.org/, the pairwise comparison of the frequencies of occurrence of 400 k-spaced amino acid pairs between the VINIPs and non-VINIPs, and AVPs/ARVPs were displayed in the 20 × 20 matrices; P-values from the Student’s t-test are shown in Supplementary Tables S3 and S4 available online at http://bib.oxfordjournals.org/, and the enriched and suppressed pairs are marked in red and green, respectively. When the gap size is zero (k = 0), the result of the investigation corresponds to the composition of dipeptides, the dipeptides are composed of aliphatic amino acids and other residues are over-represented in the VINIPs compared to the non-VINIPs, such as AC, AE, AG, AY, IR, IH, II, IL, IK, IY, LA, LD, LI, LL, LK and LF. As the gap size increased, there are significant frequency differences in the pairs of AxR, AxI, AxF, AxW, IxN, IxE, IxM, LxG, LxI, LxM, WxA, WxG and YxL between the VINIPs and non-VINIPs when k = 1, as well as when k = 2 the pairs of AxxL, AxxK, AxxW, IxxQ, IxxE, IxxI, IxxL, LxxA, LxxR, LxxL, LxxM and YxxL, and when k = 3 the pairs of AxxxA, AxxxE, CxxxG, GxxxE, IxxxD, IxxxF, IxxxY, LxxxA, LxxxG, LxxxW, FxxxL, WxxxI, WxxxK, YxxxI and YxxxK are enriched in the VINIPs. This result suggested that a large number of combinations of amino acid pairs composed of aliphatic residues were significantly enriched in the VINIP sequences, which is one of the important features for distinguishing the VINIPs from the non-VINIPs.

Comparison of the frequencies of occurrence of 20 × 20 amino acid pairs separated by k residues between the VINIPs and non-VINIPs.
Investigation of the position-specific amino acid composition of the N- and C-terminal regions
To better understand the mechanisms of molecular recognition through disordered regions, here, the pairwise comparisons of the frequency of occurrence of the amino acids for each position at the N- and C-terminus between the VINIPs and the non-VINIPs and between the other ARVPs were performed individually based on the two training datasets. The graphical representation was created using TwoSampleLogo software [42], which shows the position-specific AAC of the first and last five amino acids in the peptide sequences. Figure 4 indicates that polar and charged amino acids R (Arginine), E (Glutamate), K (Lysine) and D (Aspartate) are enriched at the position −2 of the C-terminal end of the VINIPs, as well as the frequencies of occurrence of Q and G (Glycine) are higher than the non-VINIPs and the ARVPs at the position −3. On the other hand, a remarkable enrichment of the non-polar residues such as A, F, L and W was particularly observed at the N terminus of VINIPs. Additionally, Q, A and C residues are also more commonly found at the N-terminus of the VINIPs. Furthermore, to discover the potential integrase-binding interfaces of the VINIPs, a novel sequence motif discovery tool, STREME [43], was applied to identify the motifs in the sequence of VINIPs that were most likely to play a major role in inhibiting viral IN proteins. A total of five short motifs were identified such as TAYFLLKLAGRW, SLKIDNLD, ESMNKELKKI, DQAEHLKT and ACWWWAGIKQEF, providing insight into the interplay between the VINIPs and IN proteins, and the possible mechanisms need to be further investigated through experimental work; the detailed information is presented in Supplementary Table S5 available online at http://bib.oxfordjournals.org/.

Pairwise comparisons of the composition of amino acids at the N- and C-terminal regions between the VINIPs and the non-VINIPs and between the other ARVPs.
Cross-validation performance of the VINIP models trained with individual feature
To determine the discrimination capability of the investigated features in distinguishing between VINIPs and non-VINIPs, we have trained the models by each feature subset, as well as the model validation was performed using five repetitions of the 5-fold cross-validation. With reference to the recent works of antimicrobial peptide prediction [44, 45], each peptide sequence was encoded using the various feature encoding methods, and multiple algorithms such as support vector machine (SVM) [46], random forest (RF) [47], k-nearest neighbors (KNN) [48], decision tree (DT) [49], adaptive boosting (AdaBoost) [50], classification and regression trees (CART) [51], naive Bayes classifier (NB) [52] and extreme gradient boosting (XGBoost) [53] were applied to build up the prediction models, and the complete comparison among various methods is presented in detail in Supplementary Table S6 available online at http://bib.oxfordjournals.org/. In addition, the comparison of the SVM models trained with amino acid composition using the different kernel functions was performed, and the radial basis function (RBF) kernel can provide the best performance; thus, the SVM with the RBF kernel has been employed in this study to compare with the other machine learning algorithms (see Supplementary Table S7 available online at http://bib.oxfordjournals.org/). In this case, the models trained by using the SVM and RF algorithms could provide better performances than the other models; thus, the implementation of the VINIP classification using these two classifiers has been further discussed. As presented in Table 2, the SVM model trained with AAC provides an acceptable result with a sensitivity of 74.91%, a specificity of 76.53%, an accuracy of 76.29%, a balanced accuracy of 75.72% and the Matthews correlation coefficient (MCC) of 0.39 in classifying between the VINIPs and non-VINIPs. The models trained with the features of CKSAAP achieve exceptional performances, of which the SVM model trained with C0SAAP (DPC) provides the best predictive performance with a sensitivity of 84.55%, a specificity of 87.84%, an accuracy of 87.36%, a balanced accuracy of 86.19% and the MCC value of 0.61. The SVM models trained with C1SAAP, C2SAAP and C3SAAP also provide good performances in terms of balanced accuracy values with 84.6%, 82.3% and 85.39%, respectively. Moreover, the performance of the SVM models trained with the compositions of amino acids at the N- or C-terminus of peptides is also given in Table 2, and note that the balanced accuracy values were 74.67% and 75% in the N5AAC and C5AAC models, respectively; the performances are not satisfactory. On the other hand, the RF model trained with AAC gives an astounding performance with a sensitivity of 87.27%, a specificity of 84.91.53%, an accuracy of 85.25%, a balanced accuracy of 86.09% and the MCC of 0.58, but the performances of the RF models trained with the other features were slightly lower than SVM models. The results suggested that the sequence-based features are applicable for characterizing the peptides with the integrase inhibitory activity, but the sequence-based features at the N- and C-terminus could only provide a limited characterization power.
The results of 5-fold cross-validation of the VINIP models trained by the individual feature
Classifier . | Feature . | Sensitivity (%) . | Specificity (%) . | Accuracy (%) . | B.Accuracya (%) . | MCC . |
---|---|---|---|---|---|---|
SVM | AAC | 74.91 ± 0.0152 | 76.53 ± 0.0137 | 76.29 ± 0.0123 | 75.72 ± 0.0111 | 0.39 ± 0.0197 |
N5AAC | 73.27 ± 0.0246 | 76.06 ± 0.0153 | 75.65 ± 0.0137 | 74.67 ± 0.0149 | 0.38 ± 0.0243 | |
C5AAC | 74.18 ± 0.0189 | 75.81 ± 0.0123 | 75.57 ± 0.0098 | 75.00 ± 0.0092 | 0.38 ± 0.0147 | |
DPC | 84.55 ± 0.0241 | 87.84 ± 0.0034 | 87.36 ± 0.0047 | 86.19 ± 0.0122 | 0.61 ± 0.0180 | |
C1SAAP | 82.73 ± 0.0249 | 86.47 ± 0.0070 | 85.92 ± 0.0057 | 84.60 ± 0.0116 | 0.58 ± 0.0168 | |
C2SAAP | 81.09 ± 0.0217 | 83.50 ± 0.0092 | 83.15 ± 0.0103 | 82.30 ± 0.0145 | 0.52 ± 0.0257 | |
C3SAAP | 84.00 ± 0.0228 | 86.78 ± 0.0132 | 86.37 ± 0.0100 | 85.39 ± 0.0098 | 0.59 ± 0.0186 | |
RF | AAC | 87.27 ± 0.0144 | 84.91 ± 0.0085 | 85.25 ± 0.0086 | 86.09 ± 0.0102 | 0.58 ± 0.0199 |
N5AAC | 73.64 ± 0.0232 | 74.53 ± 0.0126 | 74.40 ± 0.0119 | 74.08 ± 0.0142 | 0.36 ± 0.0228 | |
C5AAC | 73.45 ± 0.0100 | 74.38 ± 0.0093 | 74.24 ± 0.0079 | 73.91 ± 0.0063 | 0.36 ± 0.0108 | |
DPC | 87.27 ± 0.0129 | 84.28 ± 0.0070 | 84.72 ± 0.0061 | 85.78 ± 0.0070 | 0.58 ± 0.0127 | |
C1SAAP | 85.27 ± 0.0076 | 82.81 ± 0.0055 | 83.17 ± 0.0043 | 84.04 ± 0.0034 | 0.54 ± 0.0065 | |
C2SAAP | 82.18 ± 0.0138 | 82.38 ± 0.0013 | 82.35 ± 0.0029 | 82.28 ± 0.0074 | 0.51 ± 0.0111 | |
C3SAAP | 85.64 ± 0.0149 | 84.88 ± 0.0064 | 84.99 ± 0.0066 | 85.26 ± 0.0092 | 0.57 ± 0.0167 |
Classifier . | Feature . | Sensitivity (%) . | Specificity (%) . | Accuracy (%) . | B.Accuracya (%) . | MCC . |
---|---|---|---|---|---|---|
SVM | AAC | 74.91 ± 0.0152 | 76.53 ± 0.0137 | 76.29 ± 0.0123 | 75.72 ± 0.0111 | 0.39 ± 0.0197 |
N5AAC | 73.27 ± 0.0246 | 76.06 ± 0.0153 | 75.65 ± 0.0137 | 74.67 ± 0.0149 | 0.38 ± 0.0243 | |
C5AAC | 74.18 ± 0.0189 | 75.81 ± 0.0123 | 75.57 ± 0.0098 | 75.00 ± 0.0092 | 0.38 ± 0.0147 | |
DPC | 84.55 ± 0.0241 | 87.84 ± 0.0034 | 87.36 ± 0.0047 | 86.19 ± 0.0122 | 0.61 ± 0.0180 | |
C1SAAP | 82.73 ± 0.0249 | 86.47 ± 0.0070 | 85.92 ± 0.0057 | 84.60 ± 0.0116 | 0.58 ± 0.0168 | |
C2SAAP | 81.09 ± 0.0217 | 83.50 ± 0.0092 | 83.15 ± 0.0103 | 82.30 ± 0.0145 | 0.52 ± 0.0257 | |
C3SAAP | 84.00 ± 0.0228 | 86.78 ± 0.0132 | 86.37 ± 0.0100 | 85.39 ± 0.0098 | 0.59 ± 0.0186 | |
RF | AAC | 87.27 ± 0.0144 | 84.91 ± 0.0085 | 85.25 ± 0.0086 | 86.09 ± 0.0102 | 0.58 ± 0.0199 |
N5AAC | 73.64 ± 0.0232 | 74.53 ± 0.0126 | 74.40 ± 0.0119 | 74.08 ± 0.0142 | 0.36 ± 0.0228 | |
C5AAC | 73.45 ± 0.0100 | 74.38 ± 0.0093 | 74.24 ± 0.0079 | 73.91 ± 0.0063 | 0.36 ± 0.0108 | |
DPC | 87.27 ± 0.0129 | 84.28 ± 0.0070 | 84.72 ± 0.0061 | 85.78 ± 0.0070 | 0.58 ± 0.0127 | |
C1SAAP | 85.27 ± 0.0076 | 82.81 ± 0.0055 | 83.17 ± 0.0043 | 84.04 ± 0.0034 | 0.54 ± 0.0065 | |
C2SAAP | 82.18 ± 0.0138 | 82.38 ± 0.0013 | 82.35 ± 0.0029 | 82.28 ± 0.0074 | 0.51 ± 0.0111 | |
C3SAAP | 85.64 ± 0.0149 | 84.88 ± 0.0064 | 84.99 ± 0.0066 | 85.26 ± 0.0092 | 0.57 ± 0.0167 |
aB.Accuracy, balanced accuracy. The values represent the mean and standard deviation of all measurements.
The results of 5-fold cross-validation of the VINIP models trained by the individual feature
Classifier . | Feature . | Sensitivity (%) . | Specificity (%) . | Accuracy (%) . | B.Accuracya (%) . | MCC . |
---|---|---|---|---|---|---|
SVM | AAC | 74.91 ± 0.0152 | 76.53 ± 0.0137 | 76.29 ± 0.0123 | 75.72 ± 0.0111 | 0.39 ± 0.0197 |
N5AAC | 73.27 ± 0.0246 | 76.06 ± 0.0153 | 75.65 ± 0.0137 | 74.67 ± 0.0149 | 0.38 ± 0.0243 | |
C5AAC | 74.18 ± 0.0189 | 75.81 ± 0.0123 | 75.57 ± 0.0098 | 75.00 ± 0.0092 | 0.38 ± 0.0147 | |
DPC | 84.55 ± 0.0241 | 87.84 ± 0.0034 | 87.36 ± 0.0047 | 86.19 ± 0.0122 | 0.61 ± 0.0180 | |
C1SAAP | 82.73 ± 0.0249 | 86.47 ± 0.0070 | 85.92 ± 0.0057 | 84.60 ± 0.0116 | 0.58 ± 0.0168 | |
C2SAAP | 81.09 ± 0.0217 | 83.50 ± 0.0092 | 83.15 ± 0.0103 | 82.30 ± 0.0145 | 0.52 ± 0.0257 | |
C3SAAP | 84.00 ± 0.0228 | 86.78 ± 0.0132 | 86.37 ± 0.0100 | 85.39 ± 0.0098 | 0.59 ± 0.0186 | |
RF | AAC | 87.27 ± 0.0144 | 84.91 ± 0.0085 | 85.25 ± 0.0086 | 86.09 ± 0.0102 | 0.58 ± 0.0199 |
N5AAC | 73.64 ± 0.0232 | 74.53 ± 0.0126 | 74.40 ± 0.0119 | 74.08 ± 0.0142 | 0.36 ± 0.0228 | |
C5AAC | 73.45 ± 0.0100 | 74.38 ± 0.0093 | 74.24 ± 0.0079 | 73.91 ± 0.0063 | 0.36 ± 0.0108 | |
DPC | 87.27 ± 0.0129 | 84.28 ± 0.0070 | 84.72 ± 0.0061 | 85.78 ± 0.0070 | 0.58 ± 0.0127 | |
C1SAAP | 85.27 ± 0.0076 | 82.81 ± 0.0055 | 83.17 ± 0.0043 | 84.04 ± 0.0034 | 0.54 ± 0.0065 | |
C2SAAP | 82.18 ± 0.0138 | 82.38 ± 0.0013 | 82.35 ± 0.0029 | 82.28 ± 0.0074 | 0.51 ± 0.0111 | |
C3SAAP | 85.64 ± 0.0149 | 84.88 ± 0.0064 | 84.99 ± 0.0066 | 85.26 ± 0.0092 | 0.57 ± 0.0167 |
Classifier . | Feature . | Sensitivity (%) . | Specificity (%) . | Accuracy (%) . | B.Accuracya (%) . | MCC . |
---|---|---|---|---|---|---|
SVM | AAC | 74.91 ± 0.0152 | 76.53 ± 0.0137 | 76.29 ± 0.0123 | 75.72 ± 0.0111 | 0.39 ± 0.0197 |
N5AAC | 73.27 ± 0.0246 | 76.06 ± 0.0153 | 75.65 ± 0.0137 | 74.67 ± 0.0149 | 0.38 ± 0.0243 | |
C5AAC | 74.18 ± 0.0189 | 75.81 ± 0.0123 | 75.57 ± 0.0098 | 75.00 ± 0.0092 | 0.38 ± 0.0147 | |
DPC | 84.55 ± 0.0241 | 87.84 ± 0.0034 | 87.36 ± 0.0047 | 86.19 ± 0.0122 | 0.61 ± 0.0180 | |
C1SAAP | 82.73 ± 0.0249 | 86.47 ± 0.0070 | 85.92 ± 0.0057 | 84.60 ± 0.0116 | 0.58 ± 0.0168 | |
C2SAAP | 81.09 ± 0.0217 | 83.50 ± 0.0092 | 83.15 ± 0.0103 | 82.30 ± 0.0145 | 0.52 ± 0.0257 | |
C3SAAP | 84.00 ± 0.0228 | 86.78 ± 0.0132 | 86.37 ± 0.0100 | 85.39 ± 0.0098 | 0.59 ± 0.0186 | |
RF | AAC | 87.27 ± 0.0144 | 84.91 ± 0.0085 | 85.25 ± 0.0086 | 86.09 ± 0.0102 | 0.58 ± 0.0199 |
N5AAC | 73.64 ± 0.0232 | 74.53 ± 0.0126 | 74.40 ± 0.0119 | 74.08 ± 0.0142 | 0.36 ± 0.0228 | |
C5AAC | 73.45 ± 0.0100 | 74.38 ± 0.0093 | 74.24 ± 0.0079 | 73.91 ± 0.0063 | 0.36 ± 0.0108 | |
DPC | 87.27 ± 0.0129 | 84.28 ± 0.0070 | 84.72 ± 0.0061 | 85.78 ± 0.0070 | 0.58 ± 0.0127 | |
C1SAAP | 85.27 ± 0.0076 | 82.81 ± 0.0055 | 83.17 ± 0.0043 | 84.04 ± 0.0034 | 0.54 ± 0.0065 | |
C2SAAP | 82.18 ± 0.0138 | 82.38 ± 0.0013 | 82.35 ± 0.0029 | 82.28 ± 0.0074 | 0.51 ± 0.0111 | |
C3SAAP | 85.64 ± 0.0149 | 84.88 ± 0.0064 | 84.99 ± 0.0066 | 85.26 ± 0.0092 | 0.57 ± 0.0167 |
aB.Accuracy, balanced accuracy. The values represent the mean and standard deviation of all measurements.
Performance evaluation of the VINIP models trained with the hybrid feature set
According to the above results, the models trained with the sequence-based features give efficient performance in classification on the training dataset. However, based on the experience [54, 55], the models trained with the hybrid feature set could give a better performance in terms of the average accuracy compared to the use of individual features. Thus, to enhance the predictive capability, these features were combined additively or in a more integrative manner and were applied to the classifier. As shown in Table 3, the results show that improved performance was obtained from the models trained with the combination of the sequence-based properties of the peptides, and the complete comparison among various methods is presented in detail in Supplementary Table S8 available online at http://bib.oxfordjournals.org/. The model trained with the combination of AAC and DPC features provides a sensitivity of 85.08%, a specificity of 87.53%, an accuracy of 87.17% and a value of MCC equals 0.61. The model trained with the compositions of N- and C-terminal residues gives an unsatisfied performance with a sensitivity of 76.73%, a specificity of 83.09%, an accuracy of 82.16% and a value of MCC equals 0.48; however, no indication that the model trained by combining these two features has made improvements in predicting the VINIPs. Notably, the model trained with the combination of AAC, DPC and CKSAAP exhibits the best overall performance with a sensitivity of 85.82%, a specificity of 88.81%, an accuracy of 88.37% and the value of MCC reached 0.64. The result suggests that the model trained with the hybrid feature set could exhibit higher performance than a single feature set in the prediction of the VINIPs. Moreover, Figure 5 shows the comparison of receiver operating characteristic (ROC) curves among the models trained with the different combinations of feature sets, and the area under the ROC curve (AUC) for each model was measured. In summary, the model trained with AAC, DPC and CKSAAP can significantly enhance the predictive performance for distinguishing between the VINIPs and non-VINIPs. In addition, to estimate the domain of applicability of the proposed model, a multivariate outlier detection technique, the local outlier factor (LOF) algorithm [56] was used to evaluate the degree of deviation. The LOF values ranged from 0.989 to 2.597, with a mean value of 1.208 and a standard deviation is 0.179, and only a very few peptides were considered outliers (6.53%) with a value higher than the mean.
The results of 5-fold cross-validation of the VINIP models trained with the hybrid feature set
Classifier . | Feature . | Sensitivity (%) . | Specificity (%) . | Accuracy (%) . | B.Accuracya (%) . | MCC . |
---|---|---|---|---|---|---|
SVM | AAC + DPC | 85.09 ± 0.0285 | 87.53 ± 0.0045 | 87.17 ± 0.0077 | 86.31 ± 0.0162 | 0.61 ± 0.0266 |
N5AAC + C5AAC | 76.73 ± 0.0344 | 83.09 ± 0.0047 | 82.16 ± 0.0053 | 79.91 ± 0.0165 | 0.48 ± 0.0232 | |
AAC + DPC+ N5AAC + C5AAC | 82.00 ± 0.0135 | 82.09 ± 0.0090 | 82.08 ± 0.0096 | 82.05 ± 0.0112 | 0.51 ± 0.0210 | |
AAC + DPC+ CKSAAP | 85.82 ± 0.0177 | 88.81 ± 0.0110 | 88.37 ± 0.0097 | 87.32 ± 0.0104 | 0.64 ± 0.0227 | |
RF | AAC + DPC | 87.82 ± 0.0104 | 84.81 ± 0.0078 | 85.25 ± 0.0065 | 86.32 ± 0.0058 | 0.59 ± 0.0121 |
N5AAC + C5AAC | 80.55 ± 0.0104 | 75.34 ± 0.0139 | 76.11 ± 0.0108 | 77.94 ± 0.0048 | 0.42 ± 0.0104 | |
AAC + DPC+ N5AAC + C5AAC | 89.45 ± 0.0209 | 84.56 ± 0.0111 | 85.28 ± 0.0109 | 87.01 ± 0.0135 | 0.60 ± 0.0253 | |
AAC + DPC+ CKSAAP | 87.64 ± 0.0081 | 83.97 ± 0.0078 | 84.51 ± 0.0066 | 85.80 ± 0.0053 | 0.57 ± 0.0118 |
Classifier . | Feature . | Sensitivity (%) . | Specificity (%) . | Accuracy (%) . | B.Accuracya (%) . | MCC . |
---|---|---|---|---|---|---|
SVM | AAC + DPC | 85.09 ± 0.0285 | 87.53 ± 0.0045 | 87.17 ± 0.0077 | 86.31 ± 0.0162 | 0.61 ± 0.0266 |
N5AAC + C5AAC | 76.73 ± 0.0344 | 83.09 ± 0.0047 | 82.16 ± 0.0053 | 79.91 ± 0.0165 | 0.48 ± 0.0232 | |
AAC + DPC+ N5AAC + C5AAC | 82.00 ± 0.0135 | 82.09 ± 0.0090 | 82.08 ± 0.0096 | 82.05 ± 0.0112 | 0.51 ± 0.0210 | |
AAC + DPC+ CKSAAP | 85.82 ± 0.0177 | 88.81 ± 0.0110 | 88.37 ± 0.0097 | 87.32 ± 0.0104 | 0.64 ± 0.0227 | |
RF | AAC + DPC | 87.82 ± 0.0104 | 84.81 ± 0.0078 | 85.25 ± 0.0065 | 86.32 ± 0.0058 | 0.59 ± 0.0121 |
N5AAC + C5AAC | 80.55 ± 0.0104 | 75.34 ± 0.0139 | 76.11 ± 0.0108 | 77.94 ± 0.0048 | 0.42 ± 0.0104 | |
AAC + DPC+ N5AAC + C5AAC | 89.45 ± 0.0209 | 84.56 ± 0.0111 | 85.28 ± 0.0109 | 87.01 ± 0.0135 | 0.60 ± 0.0253 | |
AAC + DPC+ CKSAAP | 87.64 ± 0.0081 | 83.97 ± 0.0078 | 84.51 ± 0.0066 | 85.80 ± 0.0053 | 0.57 ± 0.0118 |
aB.Accuracy, balanced accuracy. The values represent the mean and standard deviation of all measurements.
The results of 5-fold cross-validation of the VINIP models trained with the hybrid feature set
Classifier . | Feature . | Sensitivity (%) . | Specificity (%) . | Accuracy (%) . | B.Accuracya (%) . | MCC . |
---|---|---|---|---|---|---|
SVM | AAC + DPC | 85.09 ± 0.0285 | 87.53 ± 0.0045 | 87.17 ± 0.0077 | 86.31 ± 0.0162 | 0.61 ± 0.0266 |
N5AAC + C5AAC | 76.73 ± 0.0344 | 83.09 ± 0.0047 | 82.16 ± 0.0053 | 79.91 ± 0.0165 | 0.48 ± 0.0232 | |
AAC + DPC+ N5AAC + C5AAC | 82.00 ± 0.0135 | 82.09 ± 0.0090 | 82.08 ± 0.0096 | 82.05 ± 0.0112 | 0.51 ± 0.0210 | |
AAC + DPC+ CKSAAP | 85.82 ± 0.0177 | 88.81 ± 0.0110 | 88.37 ± 0.0097 | 87.32 ± 0.0104 | 0.64 ± 0.0227 | |
RF | AAC + DPC | 87.82 ± 0.0104 | 84.81 ± 0.0078 | 85.25 ± 0.0065 | 86.32 ± 0.0058 | 0.59 ± 0.0121 |
N5AAC + C5AAC | 80.55 ± 0.0104 | 75.34 ± 0.0139 | 76.11 ± 0.0108 | 77.94 ± 0.0048 | 0.42 ± 0.0104 | |
AAC + DPC+ N5AAC + C5AAC | 89.45 ± 0.0209 | 84.56 ± 0.0111 | 85.28 ± 0.0109 | 87.01 ± 0.0135 | 0.60 ± 0.0253 | |
AAC + DPC+ CKSAAP | 87.64 ± 0.0081 | 83.97 ± 0.0078 | 84.51 ± 0.0066 | 85.80 ± 0.0053 | 0.57 ± 0.0118 |
Classifier . | Feature . | Sensitivity (%) . | Specificity (%) . | Accuracy (%) . | B.Accuracya (%) . | MCC . |
---|---|---|---|---|---|---|
SVM | AAC + DPC | 85.09 ± 0.0285 | 87.53 ± 0.0045 | 87.17 ± 0.0077 | 86.31 ± 0.0162 | 0.61 ± 0.0266 |
N5AAC + C5AAC | 76.73 ± 0.0344 | 83.09 ± 0.0047 | 82.16 ± 0.0053 | 79.91 ± 0.0165 | 0.48 ± 0.0232 | |
AAC + DPC+ N5AAC + C5AAC | 82.00 ± 0.0135 | 82.09 ± 0.0090 | 82.08 ± 0.0096 | 82.05 ± 0.0112 | 0.51 ± 0.0210 | |
AAC + DPC+ CKSAAP | 85.82 ± 0.0177 | 88.81 ± 0.0110 | 88.37 ± 0.0097 | 87.32 ± 0.0104 | 0.64 ± 0.0227 | |
RF | AAC + DPC | 87.82 ± 0.0104 | 84.81 ± 0.0078 | 85.25 ± 0.0065 | 86.32 ± 0.0058 | 0.59 ± 0.0121 |
N5AAC + C5AAC | 80.55 ± 0.0104 | 75.34 ± 0.0139 | 76.11 ± 0.0108 | 77.94 ± 0.0048 | 0.42 ± 0.0104 | |
AAC + DPC+ N5AAC + C5AAC | 89.45 ± 0.0209 | 84.56 ± 0.0111 | 85.28 ± 0.0109 | 87.01 ± 0.0135 | 0.60 ± 0.0253 | |
AAC + DPC+ CKSAAP | 87.64 ± 0.0081 | 83.97 ± 0.0078 | 84.51 ± 0.0066 | 85.80 ± 0.0053 | 0.57 ± 0.0118 |
aB.Accuracy, balanced accuracy. The values represent the mean and standard deviation of all measurements.

Independent testing of the VINIP prediction model
To evaluate the predictive performance of the proposed model, an additional dataset was used to perform the independent testing that was unseen during the training, which consisted of 12 VINIPs and 71 non-VINIPs. As shown in Table 4, the proposed method provides reliable performance, with a sensitivity of 83.33%, a specificity of 84.51%, an accuracy of 84.34% and the MCC value of 0.55. On the other hand, unfortunately, no method has been reported for the discovery of the novel VINIPs, but few tools have been developed specifically for the identification of the ARVPs. Therefore, based on the independent testing dataset, two tools were chosen for comparing the predictive power of different models, including AVPIC50Pred [29] and AVPIden [30]. The AVP-IC50Pred web server has provided various models that attempt to predict the antiviral activity in terms of half maximal inhibitory concentration; the SVM and RF models of AVP-IC50Pred that give the better performance were employed for the testing in this study. However, no matter which models of AVP-IC50Pred are used for comparison, the sensitivity values are 66.67% and the specificity values are relatively lower at 0% and 42.25%, respectively. Additionally, the result from AVPIden that gives a lower balanced accuracy rate of 50.71% indicated an unbalanced performance for detecting the VINIPs and the non-AVPs (see Table 4). In summary, the comparison results indicate that the proposed model can outperform other tools overall and can handle the problem of imbalanced classification between the VINIPs and non-VINIPs.
Comparison of the independent testing results among our method and the available prediction tools
Method . | Sensitivity (%) . | Specificity (%) . | Accuracy (%) . | B.Accuracya (%) . | MCC . |
---|---|---|---|---|---|
iDVIP | 83.33 | 84.51 | 84.34 | 83.92 | 0.55 |
AVP-IC50Pred (SVM) | 66.67 | 0.00 | 9.64 | 33.34 | -0.55 |
AVP-IC50Pred (RF) | 66.67 | 42.25 | 45.78 | 54.46 | 0.06 |
AVPIden | 100.00 | 1.41 | 15.66 | 50.71 | 0.05 |
Method . | Sensitivity (%) . | Specificity (%) . | Accuracy (%) . | B.Accuracya (%) . | MCC . |
---|---|---|---|---|---|
iDVIP | 83.33 | 84.51 | 84.34 | 83.92 | 0.55 |
AVP-IC50Pred (SVM) | 66.67 | 0.00 | 9.64 | 33.34 | -0.55 |
AVP-IC50Pred (RF) | 66.67 | 42.25 | 45.78 | 54.46 | 0.06 |
AVPIden | 100.00 | 1.41 | 15.66 | 50.71 | 0.05 |
aB.Accuracy, balanced accuracy.
Comparison of the independent testing results among our method and the available prediction tools
Method . | Sensitivity (%) . | Specificity (%) . | Accuracy (%) . | B.Accuracya (%) . | MCC . |
---|---|---|---|---|---|
iDVIP | 83.33 | 84.51 | 84.34 | 83.92 | 0.55 |
AVP-IC50Pred (SVM) | 66.67 | 0.00 | 9.64 | 33.34 | -0.55 |
AVP-IC50Pred (RF) | 66.67 | 42.25 | 45.78 | 54.46 | 0.06 |
AVPIden | 100.00 | 1.41 | 15.66 | 50.71 | 0.05 |
Method . | Sensitivity (%) . | Specificity (%) . | Accuracy (%) . | B.Accuracya (%) . | MCC . |
---|---|---|---|---|---|
iDVIP | 83.33 | 84.51 | 84.34 | 83.92 | 0.55 |
AVP-IC50Pred (SVM) | 66.67 | 0.00 | 9.64 | 33.34 | -0.55 |
AVP-IC50Pred (RF) | 66.67 | 42.25 | 45.78 | 54.46 | 0.06 |
AVPIden | 100.00 | 1.41 | 15.66 | 50.71 | 0.05 |
aB.Accuracy, balanced accuracy.
Distinguishing the VINIPs from the ARVPs with the other inhibitory activities
The antiretroviral activity of a few peptides has been well confirmed in the experiments, and some studies also reported that multiple mechanisms are involved in inhibiting retroviral infection and proliferation, but without delving deeper into the characteristics of these peptides in each mechanism of action. Here, to distinguish the ARVPs with the integrase inhibitory (INI) activity from the others, the INI-type classification model has been constructed based on another set of data, which consisted of 110 VINIPs and 459 other ARVPs were considered as the positive and negative class, respectively. The 5-fold cross-validation has also been employed to evaluate the performance of the models. As shown in Table 5, the DPC model also exhibits a great performance with a sensitivity of 84.55%, a specificity of 84.23%, an accuracy of 84.29% and an MCC value of 0.60. Moreover, the model trained with the combination of AAC and DPC features shows a slightly improved performance with a sensitivity of 85.09%, a specificity of 87.06%, an accuracy of 86.68% and a value of MCC equals 0.64. The result of cross-validation in the INI-type classification model is similar to the VINIP model; the models trained with the hybrid feature set could give a better performance than the individual features. Furthermore, the result suggested that the sequence-based features are also applicable for distinguishing the VINIPs from the ARVPs with the other inhibitory activities.
The results of 5-fold cross-validation of the INI-type classification models
Feature set . | Sensitivity (%) . | Specificity (%) . | Accuracy (%) . | B.Accuracya (%) . | MCC . |
---|---|---|---|---|---|
AAC | 75.45 ± 0.0257 | 80.00 ± 0.0123 | 79.12 ± 0.0112 | 77.73 ± 0.0144 | 0.47 ± 0.0249 |
N5AAC | 72.73 ± 0.0265 | 77.08 ± 0.0106 | 76.24 ± 0.0111 | 74.90 ± 0.0156 | 0.42 ± 0.0265 |
C5AAC | 76.00 ± 0.0246 | 75.82 ± 0.0159 | 75.85 ± 0.0101 | 75.91 ± 0.0087 | 0.43 ± 0.0139 |
DPC | 84.55 ± 0.0213 | 84.23 ± 0.0147 | 84.29 ± 0.0143 | 84.39 ± 0.0156 | 0.60 ± 0.0307 |
C1SAAP | 84.18 ± 0.0189 | 83.22 ± 0.0193 | 83.41 ± 0.0161 | 83.70 ± 0.0137 | 0.58 ± 0.0291 |
C2SAAP | 83.82 ± 0.0235 | 85.58 ± 0.0050 | 85.24 ± 0.0054 | 84.70 ± 0.0115 | 0.61 ± 0.0177 |
C3SAAP | 84.91 ± 0.0356 | 86.41 ± 0.0063 | 86.12 ± 0.0066 | 85.66 ± 0.0167 | 0.63 ± 0.0242 |
AAC + DPC | 85.09 ± 0.0254 | 87.06 ± 0.0078 | 86.68 ± 0.0073 | 86.07 ± 0.0126 | 0.64 ± 0.0205 |
N5AAC + C5AAC | 79.09 ± 0.0170 | 79.39 ± 0.0107 | 79.33 ± 0.0093 | 79.24 ± 0.0102 | 0.50 ± 0.0184 |
AAC + DPC+ N5AAC + C5AAC | 80.18 ± 0.0268 | 80.26 ± 0.0129 | 80.25 ± 0.0088 | 80.22 ± 0.0114 | 0.51 ± 0.0178 |
AAC + DPC+ CKSAAP | 84.36 ± 0.0304 | 87.19 ± 0.0060 | 86.64 ± 0.0058 | 85.78 ± 0.0142 | 0.64 ± 0.0206 |
Feature set . | Sensitivity (%) . | Specificity (%) . | Accuracy (%) . | B.Accuracya (%) . | MCC . |
---|---|---|---|---|---|
AAC | 75.45 ± 0.0257 | 80.00 ± 0.0123 | 79.12 ± 0.0112 | 77.73 ± 0.0144 | 0.47 ± 0.0249 |
N5AAC | 72.73 ± 0.0265 | 77.08 ± 0.0106 | 76.24 ± 0.0111 | 74.90 ± 0.0156 | 0.42 ± 0.0265 |
C5AAC | 76.00 ± 0.0246 | 75.82 ± 0.0159 | 75.85 ± 0.0101 | 75.91 ± 0.0087 | 0.43 ± 0.0139 |
DPC | 84.55 ± 0.0213 | 84.23 ± 0.0147 | 84.29 ± 0.0143 | 84.39 ± 0.0156 | 0.60 ± 0.0307 |
C1SAAP | 84.18 ± 0.0189 | 83.22 ± 0.0193 | 83.41 ± 0.0161 | 83.70 ± 0.0137 | 0.58 ± 0.0291 |
C2SAAP | 83.82 ± 0.0235 | 85.58 ± 0.0050 | 85.24 ± 0.0054 | 84.70 ± 0.0115 | 0.61 ± 0.0177 |
C3SAAP | 84.91 ± 0.0356 | 86.41 ± 0.0063 | 86.12 ± 0.0066 | 85.66 ± 0.0167 | 0.63 ± 0.0242 |
AAC + DPC | 85.09 ± 0.0254 | 87.06 ± 0.0078 | 86.68 ± 0.0073 | 86.07 ± 0.0126 | 0.64 ± 0.0205 |
N5AAC + C5AAC | 79.09 ± 0.0170 | 79.39 ± 0.0107 | 79.33 ± 0.0093 | 79.24 ± 0.0102 | 0.50 ± 0.0184 |
AAC + DPC+ N5AAC + C5AAC | 80.18 ± 0.0268 | 80.26 ± 0.0129 | 80.25 ± 0.0088 | 80.22 ± 0.0114 | 0.51 ± 0.0178 |
AAC + DPC+ CKSAAP | 84.36 ± 0.0304 | 87.19 ± 0.0060 | 86.64 ± 0.0058 | 85.78 ± 0.0142 | 0.64 ± 0.0206 |
aB.Accuracy, balanced accuracy. The values represent the mean and standard deviation of all measurements.
The results of 5-fold cross-validation of the INI-type classification models
Feature set . | Sensitivity (%) . | Specificity (%) . | Accuracy (%) . | B.Accuracya (%) . | MCC . |
---|---|---|---|---|---|
AAC | 75.45 ± 0.0257 | 80.00 ± 0.0123 | 79.12 ± 0.0112 | 77.73 ± 0.0144 | 0.47 ± 0.0249 |
N5AAC | 72.73 ± 0.0265 | 77.08 ± 0.0106 | 76.24 ± 0.0111 | 74.90 ± 0.0156 | 0.42 ± 0.0265 |
C5AAC | 76.00 ± 0.0246 | 75.82 ± 0.0159 | 75.85 ± 0.0101 | 75.91 ± 0.0087 | 0.43 ± 0.0139 |
DPC | 84.55 ± 0.0213 | 84.23 ± 0.0147 | 84.29 ± 0.0143 | 84.39 ± 0.0156 | 0.60 ± 0.0307 |
C1SAAP | 84.18 ± 0.0189 | 83.22 ± 0.0193 | 83.41 ± 0.0161 | 83.70 ± 0.0137 | 0.58 ± 0.0291 |
C2SAAP | 83.82 ± 0.0235 | 85.58 ± 0.0050 | 85.24 ± 0.0054 | 84.70 ± 0.0115 | 0.61 ± 0.0177 |
C3SAAP | 84.91 ± 0.0356 | 86.41 ± 0.0063 | 86.12 ± 0.0066 | 85.66 ± 0.0167 | 0.63 ± 0.0242 |
AAC + DPC | 85.09 ± 0.0254 | 87.06 ± 0.0078 | 86.68 ± 0.0073 | 86.07 ± 0.0126 | 0.64 ± 0.0205 |
N5AAC + C5AAC | 79.09 ± 0.0170 | 79.39 ± 0.0107 | 79.33 ± 0.0093 | 79.24 ± 0.0102 | 0.50 ± 0.0184 |
AAC + DPC+ N5AAC + C5AAC | 80.18 ± 0.0268 | 80.26 ± 0.0129 | 80.25 ± 0.0088 | 80.22 ± 0.0114 | 0.51 ± 0.0178 |
AAC + DPC+ CKSAAP | 84.36 ± 0.0304 | 87.19 ± 0.0060 | 86.64 ± 0.0058 | 85.78 ± 0.0142 | 0.64 ± 0.0206 |
Feature set . | Sensitivity (%) . | Specificity (%) . | Accuracy (%) . | B.Accuracya (%) . | MCC . |
---|---|---|---|---|---|
AAC | 75.45 ± 0.0257 | 80.00 ± 0.0123 | 79.12 ± 0.0112 | 77.73 ± 0.0144 | 0.47 ± 0.0249 |
N5AAC | 72.73 ± 0.0265 | 77.08 ± 0.0106 | 76.24 ± 0.0111 | 74.90 ± 0.0156 | 0.42 ± 0.0265 |
C5AAC | 76.00 ± 0.0246 | 75.82 ± 0.0159 | 75.85 ± 0.0101 | 75.91 ± 0.0087 | 0.43 ± 0.0139 |
DPC | 84.55 ± 0.0213 | 84.23 ± 0.0147 | 84.29 ± 0.0143 | 84.39 ± 0.0156 | 0.60 ± 0.0307 |
C1SAAP | 84.18 ± 0.0189 | 83.22 ± 0.0193 | 83.41 ± 0.0161 | 83.70 ± 0.0137 | 0.58 ± 0.0291 |
C2SAAP | 83.82 ± 0.0235 | 85.58 ± 0.0050 | 85.24 ± 0.0054 | 84.70 ± 0.0115 | 0.61 ± 0.0177 |
C3SAAP | 84.91 ± 0.0356 | 86.41 ± 0.0063 | 86.12 ± 0.0066 | 85.66 ± 0.0167 | 0.63 ± 0.0242 |
AAC + DPC | 85.09 ± 0.0254 | 87.06 ± 0.0078 | 86.68 ± 0.0073 | 86.07 ± 0.0126 | 0.64 ± 0.0205 |
N5AAC + C5AAC | 79.09 ± 0.0170 | 79.39 ± 0.0107 | 79.33 ± 0.0093 | 79.24 ± 0.0102 | 0.50 ± 0.0184 |
AAC + DPC+ N5AAC + C5AAC | 80.18 ± 0.0268 | 80.26 ± 0.0129 | 80.25 ± 0.0088 | 80.22 ± 0.0114 | 0.51 ± 0.0178 |
AAC + DPC+ CKSAAP | 84.36 ± 0.0304 | 87.19 ± 0.0060 | 86.64 ± 0.0058 | 85.78 ± 0.0142 | 0.64 ± 0.0206 |
aB.Accuracy, balanced accuracy. The values represent the mean and standard deviation of all measurements.
Implementation of a web-based tool for identifying the VINIPs
Developing the novel VINIP drugs still face challenges related to the high cost, time-consuming and labor-intensive process. Therefore, we proposed a computational prediction method for fast and precisely determining the potential peptides with integrase inhibitory activity. Most important of all, a web-based online tool for the automatic prediction of VINIPs was developed based on the model trained with the hybrid features. While the users input the peptide sequences in FASTA format, the system will automatically report the result including not only the probability of prediction but also the bar plot for the amino acid composition of the whole peptide. The iDVIP web server is anticipated to promote the development of ART by computationally screening the potential VINIPs; the researchers just need to make an effort to experimentally verify these peptide candidates, thus reducing the cost and execution time.
Materials and method
Data collection and preprocessing
To avoid the overfitting problem that the model tends to fit the observed data during training perfectly, thus the redundant peptide sequences were removed from the positive and negative datasets, respectively, and only kept the sequences with lengths ranging from 8 to 100 residues. The CD-HIT software package [57] was used to reduce the redundancy of the datasets, with a sequence identity cut-off value of 1.0 to eliminate duplicates. Moreover, the peptide sequences incorporating non-natural amino acids were also removed from the training and testing dataset. Finally, all the sequences have been divided into training and test sets in the ratio of 90% and 10%, respectively.
Feature investigation
Given a peptide, where i represents for each type of amino acid, |${x}_i$| stands the number of occurrences of each amino acid and L is the full length of the considered peptide.
Construction of the prediction models
SVM is a supervised machine learning algorithm that can be applied to several biological classification problems. The SVM algorithm implemented in LIBSVM [46] is used as the classifier in this study; it is a public SVM tool that adopts the RBF as the kernel function, and the softness of the hyperplane was determined by two parameters, the gamma (γ) and the cost (C). In summary, LIBSVM was employed to build the classification models using the feature vectors based on the training datasets.
Performance evaluation
To solve the imbalance classification problems, the arithmetic mean of sensitivity and specificity, balanced accuracy was used to evaluate the model performance in this work, which has been widely known as an appropriate performance metric for imbalanced classification.
Multivariate outlier detection
The local outlier factor algorithm was used to evaluate the degree of deviation in this study. It has been reported that the LOF values were calculated to identify the outliers based on the local neighborhood information, which can provide better results than some of the global approaches in the data classification process [58]. The value was calculated for each peptide in the training dataset of the model, with a larger value indicating a greater likelihood that the peptide is an outlier with respect to their neighborhoods.
Conclusion
To our knowledge, we have developed the first machine learning model for the identification of the peptides with the integrase inhibitory activity using the hybrid sequence features. It has been provided as a strategy for researchers to identify and characterized novel antiretroviral peptide drugs. In this study, we have collected the experimentally validated VINIPs as complete as possible, and the analysis of the features indicates that the sequence characteristics of these peptides could provide an overview of their putative functions, especially in the comparison of the composition of k-spaced amino acid pairs between the VINIPs and the non-VINIPs. Based on the results of the five-fold cross-validation, the discrimination capability of the investigated features in distinguishing between the VINIPs and non-VINIPs was determined, and the proposed model could provide a balanced accuracy of 87.32%. In addition, to clarify the difference in the sequence characteristics between the VINIPs and the other AVPs or ARVPs, the feature analysis and model construction were also performed, and also obtained a good outcome. In conclusion, we present a comprehensible and reliable model for the identification and characterization of the VINIPs, which make them as excellent candidates for the development of novel ART agents. Ultimately, to facilitate further research and development, iDVIP, an automatic computational tool that predicts the VINIPs has been developed, which is now freely available at http://mer.hc.mmh.org.tw/iDVIP/.
Author’s contributions
S.-L.W. conceived and supervised the project. S.-L.W. and K.-Y.H. were responsible for the project design. K.-Y.H. and H.-J.K. were responsible for computational analysis. H.-J.K. was responsible for web tool development. K.-Y.H., H.-J.K., T.-H.W. and C.-H.C. drafted the manuscript with revisions by S.-L.W. All authors read and approved the final manuscript.
We have developed the first computational prediction method for identifying viral integrase inhibitory peptides (VINIPs) based on the machine learning approach. Moreover, we also constructed a classification model able to distinguish the VINIPs from the antiretroviral peptides (ARVPs) with the other inhibitory activities.
To understand the molecular characteristics of VINIPs, we constructed a comprehensive dataset composed of 203 experimentally verified VINIPs and 1010 non-VINIPs from available databases to carry out the comparative sequence analysis. In addition, a total of 603 reviewed ARVPs were also collected and analyzed for distinguishing the VINIPs from the other ARVPs.
According to the results of cross-validation, the model trained with the combination of the sequence-based features can achieve a great predictive ability to identify the VINIPs, which can provide both sensitivity and specificity of greater than 85%. Most importantly, the model also consistently provides effective performance in independent testing.
A web-based online tool for automatic prediction of VINIPs was developed based on the model trained with the hybrid features, named iDVIP. The web server is anticipated to promote the development of ART by computationally screening the potential VINIPs; the researchers just need to make an effort to experimentally verify these peptide candidates, thus reducing the cost and execution time of work.
Availability of data and materials
The tool and datasets for establishing the iDVIP framework of this study are available at the following http://mer.hc.mmh.org.tw/iDVIP.
Acknowledgements
The authors would like to thank the members of the Department of Medical Research at the Hsinchu MacKay Memorial Hospital of Taiwan for their advice and guidance.
Funding
Hsinchu MacKay Memorial Hospital of Taiwan (MMH-HB-11108).
Author Biographies
Kai-Yao Huang is a technical director and an assistant research fellow in the Department of Medical Research at Hsinchu MacKay Memorial Hospital, Hsinchu City. He is also an assistant professor in the Department of Medicine at MacKay Medical College, New Taipei City. His research interests include bioinformatics, genomics and proteomics, computational system biology, artificial intelligence, data mining and machine learning.
Hui-Ju Kao is a postdoctoral research fellow in the Department of Medical Research at Hsinchu MacKay Memorial Hospital, Hsinchu City. Her research interests include bioinformatics, computational proteomics, data mining and machine learning.
Tzu-Hsiang Weng is a resident physician at the Department of Obstetrics and Gynecology at MacKay Memorial Hospital, Taipei City. Her research interests include obstetrics/gynecology, infertility and reproductive endocrinology.
Chia-Hung Chen is an assistant research fellow in the Department of Medical Research at Hsinchu MacKay Memorial Hospital, Hsinchu City. His research interests include immune regulation, molecular mechanism research, monoclonal antibody development, liposome development and antibody humanization engineering.
Shun-Long Weng is a superintendent and an attending physician in the Department of Obstetrics and Gynecology at Hsinchu MacKay Memorial Hospital, Hsinchu City. He is also a professor in the Department of Medicine at MacKay Medical College, New Taipei City. His research interests include obstetrics/gynecology, infertility and reproductive endocrinology, computational biology, microbiome analysis and machine learning.