Abstract

Genes are unique in functional role and differ in their sensitivities to genetic defects, but with difficulties in pathogenicity prediction. This study attempted to improve the performance of existing in silico algorithms and find a common solution based on individualization strategy. We initiated the individualization with the epilepsy-related SCN1A variants by sub-regional stratification. SCN1A missense variants related to epilepsy were retrieved from mutation databases, and benign missense variants were collected from ExAC database. Predictions were performed by using 10 traditional tools with stepwise optimizations. Model predictive ability was evaluated using the five-fold cross-validations on variants of SCN1A, SCN2A, and KCNQ2. Additional validation was performed in SCN1A variants of damage-confirmed/familial epilepsy. The performance of commonly used predictors was less satisfactory for SCN1A with accuracy less than 80% and varied dramatically by functional domains of Nav1.1. Multistep individualized optimizations, including cutoff resetting, domain-based stratification, and combination of predicting algorithms, significantly increased predictive performance. Similar improvements were obtained for variants in SCN2A and KCNQ2. The predictive performance of the recently developed ensemble tools, such as Mendelian clinically applicable pathogenicity, combined annotation-dependent depletion and Eigen, was also improved dramatically by application of the strategy with molecular sub-regional stratification. The prediction scores of SCN1A variants showed linear correlations with the degree of functional defects and the severity of clinical phenotypes. This study highlights the need of individualized optimization with molecular sub-regional stratification for each gene in practice.

Introduction

The advances in genetic sequencing technology have led to increased detection of novel sequence variants, but variant interpretation remains a major challenge. A variety of computational methods have been developed to assist the interpretation of variants. These algorithms are derived from different computational classification models based on amino acid characteristics, position-specific phylogenetic information, and statistical learning methods. However, the unique properties of individual gene products, such as structural and biological complexities of biomolecules and their interactions, might hinder successful prediction. In fact, previous investigations had revealed that in silico prediction tools have variable accuracies in different genes [1–3]. Our previous study demonstrated that the same amino acid substitution at homologous residue of sodium channel Nav1.1 or Nav1.3 produces distinct functional impairments [4]. It is suggested that gene products differ in their sensitivities to mutant damage and individualized strategies in prediction is potentially required. Here, we explored the feasibility of individualizing prediction for each gene, started with SCN1A, a particular gene coding Nav1.1 that is a fundamental mediator of electrical signaling.

In humans, there are nine Nav channels (Nav1.1–Nav1.9), which are characterized by their α subunits that are encoded by one of the most ancient and conserved gene families (SCN1ASCN11A). The α subunit of Navs is composed of four homologous domains (DI–DIV), each of which contains six transmembrane segments (S1–S6). Despite ofxs their homologous structure, Navs differ in biophysical properties and expression patterns. In particular, Nav1.1 appears to be the most important molecule that is associated with epilepsy [5, 6]. In human beings, more than 1200 SCN1A mutations have been identified, and nearly 50% of the mutations are missense [7]. The SCN1A mutation-associated epilepsies comprise a spectrum of phenotypes ranging from the extremely severe form of severe myoclonic epilepsy of infancy (SMEI or Dravet syndrome, MIM 607208) [8, 9] to the milder form of patients with generalized epilepsy with febrile seizure plus (GEFS+, MIM 604233), or pure febrile seizures (FS) [10, 11]. SCN1A mutations have been detected in 60% to 80% of patients with Dravet syndrome and 10% of patients with GEFS+ [12, 13]. SCN2A, which encodes the α subunit of Nav1.2, is also associated with epilepsy with about 50 mutations have been identified [14]. In contrast, the role of SCN3A and SCN8A, which are also expressed in the brain, remain to be confirmed [14, 15]. These findings suggest that sodium channel genes differ in their pathogenic potential. Furthermore, sodium channel genes are unique with intricate functional domains, which are also variable in the sensitivities to genetic impairments [16, 17]. Currently, a vast number of missense variants have been identified in SCN1A with locations in different functional domains of Nav1.1. In this study, we evaluated and optimized the performance of various prediction algorithms on SCN1A variants with considerations on their molecular sub-regional implications. Validation was performed on SCN1A variants with functional alterations confirmed experimentally and those from epilepsy families, as well as the variants of SCN2A and potassium channel gene KCNQ2. We aimed to provide practical tools for predicting damages of SCN1A variants and strategies for individualizing prediction for genes.

Materials and methods

Variant data sets

Epilepsy driver mutations were retrieved from SCN1A mutation database (http://www.caae.org.cn/gzneurosci/scn1adatabase/index.php) [7] and Human Gene Mutation Database (HGMD) [18] to build a positive (pathogenic) test set, which included 596 missense mutations. The negative (benign) set of non-deleterious variants was built from the Exome Aggregation Consortium database (ExAC) [19]. Only variants for which all selected programs could provide valid outputs were used in the analysis. The amino acid sequence and information of human NaV1.1 were obtained from Uniprot database (UniProt ID: P35498) and ENSP (ENSP00000303540).

In silico analysis

The damaging effect of the missense variants were predicted by 10 in silico algorithms (SIFT, MASS, PP2-HDIV, PP2-HVAR, PROVEAN, SNAP2, MutPred2, I-Mutant 2.0, FATHMM-W and FATHMM-U), which provide online services and free public access. An overview of these algorithms and the criteria used for classification of damaging effect are presented in Supplementary Table S1. To evaluate the possibility of combined application of prediction tools, we selected the top three in silico tools according to their accuracies. A variant was considered to be damaging if at least two outputs were ‘damaging’; in case of only two outputs were available, a variant was classified as damaging if one output was ‘damaging’, as in a previous study [2].

Performance assessment and validation

Scoring data generated from each prediction were collected for in-depth analysis. The predictive results were indicated by true positives (TP, correct predictions for pathogenic mutations), true negatives (TN, correct predictions for benign mutations), false positives (FP, incorrect predictions for benign mutations), and false negatives (FN, incorrect predictions for pathogenic mutations). We calculated the overall accuracy (ratio of overall correct predictions to the total number of predictions), sensitivity (TP rate), specificity (true negative rate), positive predictive values (PPV) and negative predictive values (NPV) for each in silico tool. The F score was defined as |$\frac{{\rm 2PR}}{{\rm P}+{\rm R}}$|⁠, where precision (P) =|$ \frac{{\rm TP}}{{\rm TP}+{\rm FP}}$|⁠, recall (R) =|$ \frac{{\rm TP}}{{\rm TP}+{\rm FN}}$|⁠. The Matthews correlation coefficient (MCC) was calculated using the equation |$ \frac{{\rm TP} \times {\rm TN}-{\rm FP}\times {\rm FN}}{\sqrt{({\rm TP}+{\rm FP})({\rm TP}+{\rm FN})({\rm TN}+{\rm FP})({\rm TN}+{\rm FN})}}$|⁠, with a scale of −1 to 1; −1 indicates a completely wrong binary classifier while 1 indicates a completely correct binary classifier.

Model predictive ability was evaluated using the k-fold cross-validations (k = 5) as in previous study [20]. The data were randomly split into five subsets with equal size. Four subsets were combined as training set and the remaining one subset was used as the testing set. Variants in the training set were stratified by their functional domains and trained for the best cutoff of each individual algorithm. The domain-specific classifier was generated with further combination of algorithms, and its prediction performance over the testing set was reported in the results. This procedure was repeated until all subsets had been used as a test set. Therefore, five predicting models were built over five data sets. The proposed method of model evaluating follows these criteria: accuracy, sensitivity, specificity and MCC score. The model predictive ability was evaluated using variants of SCN1A, SCN2A and KCNQ2, considering that SCN2A and KCNQ2 have homologous functional domains and are associated with epilepsy. Their pathogenic and benign missense variants were retrieved from HGMD and ExAC databases respectively.

Additionally, 47 SCN1A missense variants with functional alterations confirmed by in vitro electrophysiological recordings and 58 mutations identified in epilepsy families (http://www.caae.org.cn/gzneurosci/scn1adatabase/index.php) were used to validate the optimized prediction strategy. The 25th percentile of SCN1A missense variants (n = 76) with higher minor allele frequency (MAF > 0.000008306) from ExAC database were set as benign group for validation. We also used three ensemble methods, Mendelian clinically applicable pathogenicity (M-CAP) [21], combined annotation-dependent depletion (CADD) [22] and Eigen [23], to validate the application of our proposed strategy (Supplementary Table S1).

Statistical analysis

Pearson and spearman correlation coefficient, one-way analysis of variance (ANOVA), post hoc least significant difference (LSD) tests, and receiver operating characteristic curve (ROC) analyses were performed by SPSS statistical software (Version 17.0, IBM Corp.). The ROC analysis was used for comparison of in silico algorithms and identification of the optimum cutoff value that maximized accuracy and Youden index (sensitivity + specificity − 1) [24]. The optimum cutoff value with the highest Youden index was used to evaluate model predictive ability in the five-fold cross-validation. Area under the curve (AUC) of ROC was calculated to evaluate the discriminatory ability of the disease-based prognostic scores.

Results

Overview of SCN1A missense variants

A total of 596 missense variants were retrieved from the SCN1A mutation databases, and 345 variants were retrieved from the ExAC database. Forty variants overlapped between the two resources were excluded. The remaining 556 variants from mutation databases did not have MAFs in ExAC, neither in NHLBI ESP6500 or 1000 Genomes database, and were therefore taken as positive test set. We then compared the MAFs of the 40 overlapped variants, assumed to be pathogenic, with those of the other 305 benign missense variants from ExAC database. No significant difference in their MAFs was found between the two groups (3.30 × 10−4 ± 0.92 × 10−4, n = 40, versus 9.24 × 10−4 ± 9.01 × 10−4, n = 305, P = 0.487, Student's t-test). Pearson correlation analysis did not show significant correlation between MAF values and prediction scores of in silico tools, either (Supplementary Table S2). Therefore, we did not set a MAF cutoff value to classify benign missense variants. The prediction scores of the overlapped variants are presented intermediately between the positive test set and the negative set (Supplementary Figure S1, shown by prediction scores of SNAP2). The number of variants in the positive set (n = 556) and negative set (n = 305) and their sub-regional distributions in Nav1.1 functional domains were summarized in Supplementary Table S3.

Performance of in silico tools

The pathogenic classification accuracies of the 10 in silico algorithms for SCN1A variants ranged from 64.1% (I-Mutant2.0) to 80.6% (MutPred2) (Table 1, upper part), using default cutoff values. MutPred2, PP2-HVAR and SNAP2 were the top three predictors in accuracy rate. All algorithms, except FATHMM-U, achieved high sensitivities (> 85%) but low specificity (< 60%). FATHMM-U had a sensitivity and specificity of 63.3% and 79.0%, respectively. In terms of MCC and F-score, which represent the quality of binary classifications, the highest degree of correlation and F-score were observed in MutPred2 (MCC = 0.569, F-score = 0.865), and the next ranked performers were PP2-HVAR (MCC = 0.538, F-score = 0.854) and SNAP2 (MCC = 0.525, F-score = 0.853). According to ROC curves and AUC scores, SNAP2, MutPred2, and PP2-HVAR also performed relatively better than the others (Supplementary Figure S2A and Table S4).

Table 1

Performance of default and optimized algorithms for prediction of SCN1A missense variants

Prediction algorithmsCutoff valueTPTNFPFNAccuracy
(%)
Sensitivity (%)Specificity (%)PPV (%)NPV (%)F-scoreMCC
SIFT0.055201441613677.193.547.276.480.00.8410.479
MASS1.9355041301755273.690.642.674.271.40.8160.390
PP2-HDIV0.55181461593877.193.247.976.579.30.8400.479
PP2-HVAR0.55141711344279.692.456.179.380.30.8540.538
PROVEAN−2.55351271782176.996.241.675.085.80.8430.480
SNAP205301491562678.995.348.977.385.10.8530.525
MutPred20.55361581472080.696.451.878.588.80.8650.569
I-Mutant 2.00494582476264.188.819.066.748.30.7620.109
FATHMM-U−3.03522416420468.963.379.084.654.20.7240.405
FATHMM-W−1.555520285166.899.86.666.195.20.7950.198
After optimization of cutoff value for classification accuracy
SIFT0.005484221847281.987.172.585.275.40.8610.601
MASS2.834292297612776.477.275.185.064.30.8090.507
PP2-HDIV0.975477218877980.785.871.584.673.40.8520.576
PP2-HVAR0.842487216896981.687.670.884.575.808600.594
PROVEAN−3.4234941921136279.788.863.081.475.60.8500.543
SNAP243494242636285.588.879.388.779.60.8880.682
MutPred20.746489228776783.387.974.886.477.30.8720.632
I-Mutant 2.00.37526382673065.594.612.566.355.90.7800.125
FATHMM-U−1.394861381677072.587.445.274.466.30.8040.365
FATHMM-W−4.055031861195380.090.561.080.977.80.8540.550
Prediction algorithmsCutoff valueTPTNFPFNAccuracy
(%)
Sensitivity (%)Specificity (%)PPV (%)NPV (%)F-scoreMCC
SIFT0.055201441613677.193.547.276.480.00.8410.479
MASS1.9355041301755273.690.642.674.271.40.8160.390
PP2-HDIV0.55181461593877.193.247.976.579.30.8400.479
PP2-HVAR0.55141711344279.692.456.179.380.30.8540.538
PROVEAN−2.55351271782176.996.241.675.085.80.8430.480
SNAP205301491562678.995.348.977.385.10.8530.525
MutPred20.55361581472080.696.451.878.588.80.8650.569
I-Mutant 2.00494582476264.188.819.066.748.30.7620.109
FATHMM-U−3.03522416420468.963.379.084.654.20.7240.405
FATHMM-W−1.555520285166.899.86.666.195.20.7950.198
After optimization of cutoff value for classification accuracy
SIFT0.005484221847281.987.172.585.275.40.8610.601
MASS2.834292297612776.477.275.185.064.30.8090.507
PP2-HDIV0.975477218877980.785.871.584.673.40.8520.576
PP2-HVAR0.842487216896981.687.670.884.575.808600.594
PROVEAN−3.4234941921136279.788.863.081.475.60.8500.543
SNAP243494242636285.588.879.388.779.60.8880.682
MutPred20.746489228776783.387.974.886.477.30.8720.632
I-Mutant 2.00.37526382673065.594.612.566.355.90.7800.125
FATHMM-U−1.394861381677072.587.445.274.466.30.8040.365
FATHMM-W−4.055031861195380.090.561.080.977.80.8540.550

The data in bold highlights the best tool for predicting SCN1A missense variants before and after optimization. PPV, positive prediction value; NPV, negative prediction value; MCC, Matthews correlation coefficient.

Table 1

Performance of default and optimized algorithms for prediction of SCN1A missense variants

Prediction algorithmsCutoff valueTPTNFPFNAccuracy
(%)
Sensitivity (%)Specificity (%)PPV (%)NPV (%)F-scoreMCC
SIFT0.055201441613677.193.547.276.480.00.8410.479
MASS1.9355041301755273.690.642.674.271.40.8160.390
PP2-HDIV0.55181461593877.193.247.976.579.30.8400.479
PP2-HVAR0.55141711344279.692.456.179.380.30.8540.538
PROVEAN−2.55351271782176.996.241.675.085.80.8430.480
SNAP205301491562678.995.348.977.385.10.8530.525
MutPred20.55361581472080.696.451.878.588.80.8650.569
I-Mutant 2.00494582476264.188.819.066.748.30.7620.109
FATHMM-U−3.03522416420468.963.379.084.654.20.7240.405
FATHMM-W−1.555520285166.899.86.666.195.20.7950.198
After optimization of cutoff value for classification accuracy
SIFT0.005484221847281.987.172.585.275.40.8610.601
MASS2.834292297612776.477.275.185.064.30.8090.507
PP2-HDIV0.975477218877980.785.871.584.673.40.8520.576
PP2-HVAR0.842487216896981.687.670.884.575.808600.594
PROVEAN−3.4234941921136279.788.863.081.475.60.8500.543
SNAP243494242636285.588.879.388.779.60.8880.682
MutPred20.746489228776783.387.974.886.477.30.8720.632
I-Mutant 2.00.37526382673065.594.612.566.355.90.7800.125
FATHMM-U−1.394861381677072.587.445.274.466.30.8040.365
FATHMM-W−4.055031861195380.090.561.080.977.80.8540.550
Prediction algorithmsCutoff valueTPTNFPFNAccuracy
(%)
Sensitivity (%)Specificity (%)PPV (%)NPV (%)F-scoreMCC
SIFT0.055201441613677.193.547.276.480.00.8410.479
MASS1.9355041301755273.690.642.674.271.40.8160.390
PP2-HDIV0.55181461593877.193.247.976.579.30.8400.479
PP2-HVAR0.55141711344279.692.456.179.380.30.8540.538
PROVEAN−2.55351271782176.996.241.675.085.80.8430.480
SNAP205301491562678.995.348.977.385.10.8530.525
MutPred20.55361581472080.696.451.878.588.80.8650.569
I-Mutant 2.00494582476264.188.819.066.748.30.7620.109
FATHMM-U−3.03522416420468.963.379.084.654.20.7240.405
FATHMM-W−1.555520285166.899.86.666.195.20.7950.198
After optimization of cutoff value for classification accuracy
SIFT0.005484221847281.987.172.585.275.40.8610.601
MASS2.834292297612776.477.275.185.064.30.8090.507
PP2-HDIV0.975477218877980.785.871.584.673.40.8520.576
PP2-HVAR0.842487216896981.687.670.884.575.808600.594
PROVEAN−3.4234941921136279.788.863.081.475.60.8500.543
SNAP243494242636285.588.879.388.779.60.8880.682
MutPred20.746489228776783.387.974.886.477.30.8720.632
I-Mutant 2.00.37526382673065.594.612.566.355.90.7800.125
FATHMM-U−1.394861381677072.587.445.274.466.30.8040.365
FATHMM-W−4.055031861195380.090.561.080.977.80.8540.550

The data in bold highlights the best tool for predicting SCN1A missense variants before and after optimization. PPV, positive prediction value; NPV, negative prediction value; MCC, Matthews correlation coefficient.

The 10 algorithms generate data on a continuous scale, applying a binary ‘cutoff point’ and report results as ‘pathogenic’ or ‘benign’. When the data points generated by the algorithms were visualized in scatter plots, it was found that the default cutoff values of algorithms were set at a relatively lower level for this particular sodium channel gene (Figure 1). Therefore, the cutoff value was adjusted to a level that had the highest accuracy rate for each algorithm by ROC analysis (Figure 1). The resetting of cutoffs increased the accuracy range of these algorithms from 64.1%–80.6% to 65.5%–85.5% (Table 1, lower part). With the optimization, the best predictor was achieved by SNAP2 (MCC = 0.682, F-score = 0.888), followed by MutPred2 (MCC = 0.632, F-score = 0.872) and SIFT (MCC = 0.601, F-score = 0.861).

Box and whisker plots showing the distributions of prediction scores of the ten algorithms for SCN1A missense variants. The data were generated by different tools, including SIFT (A), MASS (B), PP2-HDIV (C), PP2-HVAR (D), PROVEAN (E), SNAP2 (F), MutPred2 (G), I-Mutant 2.0 (H), FATHMM-W (I), and FATHMM-U (J). All pathogenic groups are significantly different from the benign groups in prediction scores (n = 596 for pathogenic variants, n = 305 for benign variants, P < 0.001, student’s t-test). The solid and dash lines across the graphs indicate the default and optimized cutoffs (thresholds), respectively. Predictive scores above the cutoff lines indicate pathogenic. Note that most default cutoffs were lower than those optimized by accuracy, indicating that the default predictions have higher false positive rates. Boxes represent the 25–75th percentiles, and lines in the middle of boxes are the medians. The colored points represent individual data.
Figure 1

Box and whisker plots showing the distributions of prediction scores of the ten algorithms for SCN1A missense variants. The data were generated by different tools, including SIFT (A), MASS (B), PP2-HDIV (C), PP2-HVAR (D), PROVEAN (E), SNAP2 (F), MutPred2 (G), I-Mutant 2.0 (H), FATHMM-W (I), and FATHMM-U (J). All pathogenic groups are significantly different from the benign groups in prediction scores (n = 596 for pathogenic variants, n = 305 for benign variants, P < 0.001, student’s t-test). The solid and dash lines across the graphs indicate the default and optimized cutoffs (thresholds), respectively. Predictive scores above the cutoff lines indicate pathogenic. Note that most default cutoffs were lower than those optimized by accuracy, indicating that the default predictions have higher false positive rates. Boxes represent the 25–75th percentiles, and lines in the middle of boxes are the medians. The colored points represent individual data.

Stratified prediction based on functional domains of Nav1.1

An accuracy of 90% was proposed for classification of variants as likely pathogenic or likely benign for clinical application [25]. It appears that the accuracy improved by cutoff resetting still did not meet the criterion. Considering that Nav1.1 functional domains differ in their sensitivities to variant damage [16, 26], the variant position could be a confounding variable and need to be concerned. The SCN1A variants were then subgrouped according to their positions in Nav1.1 functional domains, i.e. N, C-Terminus, domain linker, S1–S3 transmembrane domain, voltage sensor and pore region, as in our previous studies [16, 17]. The cutoff of each algorithm was optimized to obtain the highest accuracy for predicting.

As shown in Table 2, the algorithms performed much better in general after the stratification by functional domain of Nav1.1, and further improvements were obtained by resetting the cutoffs. As far as the accuracy rate is concerned, the best algorithm for SCN1A variants was MASS in the N, C-Terminus domain (82.9%), MutPred2 in the domain linker (87.2%), SNAP2 in both S1–S3 in the transmembrane and the voltage sensor domains (85.9% and 92.9%, respectively), and PROVEAN in the pore region (92.5%) (Table 2). MASS, PP2-HVAR, SNAP2 and MutPred2 have the accuracies of over 90% in voltage sensor, while PROVEAN, SNAP2 and MutPred2 have the accuracies of over 90% in pore region. All the 10 algorithms were much better in prediction for SCN1A variants occurring in S1–S3, voltage sensor and pore region, which are the core architectures of Nav1.1, than for variants in the other domains (Table 2, Supplementary Figure S2, and Table S4).

Table 2

Optimizing with stratification by functional domains for prediction of SCN1A missense variants

AlgorithmsN, C-TerminusDomain linkerS1–S3Voltage sensorPore region
CutoffAcc (%)MCCCutoffAcc (%)MCCCutoffAcc (%)MCCCutoffAcc (%)MCCCutoffAcc (%)MCC
SIFT≤0.0564.20.350≤0.0560.50.335≤0.0574.80.446≤0.0588.90.254≤0.0588.00.521
MASS>1.93563.40.336>1.93544.20.335>1.93579.30.534>1.93585.90.153>1.93587.00.585
PP2-HDIV>0.565.90.400>0.563.40.444>0.574.80.426>0.587.90.125>0.586.10.440
PP2-HVAR>0.569.90.458>0.569.20.463>0.577.00.480>0.588.90.254>0.586.70.482
PROVEAN≤−2.565.90.400≤−2.552.90.414≤−2.574.80.446≤−2.585.90.065≤−2.592.20.674
SNAP2>064.20.350>062.80.440>074.80.430>090.90.406>090.70.603
MutPred2>0.572.40.483>0.569.80.497>0.570.40.322>0.589.90.286>0.590.10.550
I-Mutant 2.0<056.90.220<033.10.178<060.0−0.033<076.80.019<080.70.100
FATHMM-U<−3.069.10.382<−3.069.20.355<−3.079.30.596<−3.065.70.241<−3.065.40.351
FATHMM-W<−1.548.8NA<−1.533.10.252<−1.564.4NA<−1.588.9NA<−1.585.2NA
After optimization of cutoff value for with stratification
SIFT=072.40.455=077.30.384<0.0278.50.418≤0.0289.90.404≤0.0889.80.573
MASS>3.56582.90.677>3.86584.90.506>2.9183.70.648>0.4990.90.406>1.689.20.606
PP2-HDIV>0.92273.20.504>0.99780.80.469>0.89076.30.462>0.87289.90.359>0.7787.30.531
PP2-HVAR>0.96378.00.562>0.94782.00.533>0.47478.50.516>0.6790.90.4490.82387.30.574
PROVEAN<−2.03566.70.426≤−6.10379.70.326<−3.41481.50.586≤−1.89188.8NA≤−2.19792.50.685
SNAP2>5379.70.601>5786.60.589>6285.90.702>3292.90.580>−1191.00.604
MutPred2>0.72075.60.512>0.80587.20.600>0.85985.20.677>0.70090.90.417>0.44890.70.574
I-Mutant 2.0<−0.3763.40.312<−3.3977.9NA<2.864.4NA<2.3088.9NA<2.0485.2NA
FATHMM-U<−2.3270.70.420<−5.7379.10.215<−2.0383.70.637<−3.9388.9NA<4.9285.2NA
FATHMM-W<−4.1973.20.474<−3.8484.90.561<−4.4165.20.203<−4.1388.9NA<−3.9987.70.440
Results in total after multiple optimizations
Multiple-step strategyAcc = 90.5%, sensitivity = 92.6%, specificity = 86.6%, MCC = 0.792 (n = 861)
AlgorithmsN, C-TerminusDomain linkerS1–S3Voltage sensorPore region
CutoffAcc (%)MCCCutoffAcc (%)MCCCutoffAcc (%)MCCCutoffAcc (%)MCCCutoffAcc (%)MCC
SIFT≤0.0564.20.350≤0.0560.50.335≤0.0574.80.446≤0.0588.90.254≤0.0588.00.521
MASS>1.93563.40.336>1.93544.20.335>1.93579.30.534>1.93585.90.153>1.93587.00.585
PP2-HDIV>0.565.90.400>0.563.40.444>0.574.80.426>0.587.90.125>0.586.10.440
PP2-HVAR>0.569.90.458>0.569.20.463>0.577.00.480>0.588.90.254>0.586.70.482
PROVEAN≤−2.565.90.400≤−2.552.90.414≤−2.574.80.446≤−2.585.90.065≤−2.592.20.674
SNAP2>064.20.350>062.80.440>074.80.430>090.90.406>090.70.603
MutPred2>0.572.40.483>0.569.80.497>0.570.40.322>0.589.90.286>0.590.10.550
I-Mutant 2.0<056.90.220<033.10.178<060.0−0.033<076.80.019<080.70.100
FATHMM-U<−3.069.10.382<−3.069.20.355<−3.079.30.596<−3.065.70.241<−3.065.40.351
FATHMM-W<−1.548.8NA<−1.533.10.252<−1.564.4NA<−1.588.9NA<−1.585.2NA
After optimization of cutoff value for with stratification
SIFT=072.40.455=077.30.384<0.0278.50.418≤0.0289.90.404≤0.0889.80.573
MASS>3.56582.90.677>3.86584.90.506>2.9183.70.648>0.4990.90.406>1.689.20.606
PP2-HDIV>0.92273.20.504>0.99780.80.469>0.89076.30.462>0.87289.90.359>0.7787.30.531
PP2-HVAR>0.96378.00.562>0.94782.00.533>0.47478.50.516>0.6790.90.4490.82387.30.574
PROVEAN<−2.03566.70.426≤−6.10379.70.326<−3.41481.50.586≤−1.89188.8NA≤−2.19792.50.685
SNAP2>5379.70.601>5786.60.589>6285.90.702>3292.90.580>−1191.00.604
MutPred2>0.72075.60.512>0.80587.20.600>0.85985.20.677>0.70090.90.417>0.44890.70.574
I-Mutant 2.0<−0.3763.40.312<−3.3977.9NA<2.864.4NA<2.3088.9NA<2.0485.2NA
FATHMM-U<−2.3270.70.420<−5.7379.10.215<−2.0383.70.637<−3.9388.9NA<4.9285.2NA
FATHMM-W<−4.1973.20.474<−3.8484.90.561<−4.4165.20.203<−4.1388.9NA<−3.9987.70.440
Results in total after multiple optimizations
Multiple-step strategyAcc = 90.5%, sensitivity = 92.6%, specificity = 86.6%, MCC = 0.792 (n = 861)

The data in bold highlights the best tool for prediction. Acc, accuracy; MCC, Matthews correlation coefficient; NA, not applicable.

Table 2

Optimizing with stratification by functional domains for prediction of SCN1A missense variants

AlgorithmsN, C-TerminusDomain linkerS1–S3Voltage sensorPore region
CutoffAcc (%)MCCCutoffAcc (%)MCCCutoffAcc (%)MCCCutoffAcc (%)MCCCutoffAcc (%)MCC
SIFT≤0.0564.20.350≤0.0560.50.335≤0.0574.80.446≤0.0588.90.254≤0.0588.00.521
MASS>1.93563.40.336>1.93544.20.335>1.93579.30.534>1.93585.90.153>1.93587.00.585
PP2-HDIV>0.565.90.400>0.563.40.444>0.574.80.426>0.587.90.125>0.586.10.440
PP2-HVAR>0.569.90.458>0.569.20.463>0.577.00.480>0.588.90.254>0.586.70.482
PROVEAN≤−2.565.90.400≤−2.552.90.414≤−2.574.80.446≤−2.585.90.065≤−2.592.20.674
SNAP2>064.20.350>062.80.440>074.80.430>090.90.406>090.70.603
MutPred2>0.572.40.483>0.569.80.497>0.570.40.322>0.589.90.286>0.590.10.550
I-Mutant 2.0<056.90.220<033.10.178<060.0−0.033<076.80.019<080.70.100
FATHMM-U<−3.069.10.382<−3.069.20.355<−3.079.30.596<−3.065.70.241<−3.065.40.351
FATHMM-W<−1.548.8NA<−1.533.10.252<−1.564.4NA<−1.588.9NA<−1.585.2NA
After optimization of cutoff value for with stratification
SIFT=072.40.455=077.30.384<0.0278.50.418≤0.0289.90.404≤0.0889.80.573
MASS>3.56582.90.677>3.86584.90.506>2.9183.70.648>0.4990.90.406>1.689.20.606
PP2-HDIV>0.92273.20.504>0.99780.80.469>0.89076.30.462>0.87289.90.359>0.7787.30.531
PP2-HVAR>0.96378.00.562>0.94782.00.533>0.47478.50.516>0.6790.90.4490.82387.30.574
PROVEAN<−2.03566.70.426≤−6.10379.70.326<−3.41481.50.586≤−1.89188.8NA≤−2.19792.50.685
SNAP2>5379.70.601>5786.60.589>6285.90.702>3292.90.580>−1191.00.604
MutPred2>0.72075.60.512>0.80587.20.600>0.85985.20.677>0.70090.90.417>0.44890.70.574
I-Mutant 2.0<−0.3763.40.312<−3.3977.9NA<2.864.4NA<2.3088.9NA<2.0485.2NA
FATHMM-U<−2.3270.70.420<−5.7379.10.215<−2.0383.70.637<−3.9388.9NA<4.9285.2NA
FATHMM-W<−4.1973.20.474<−3.8484.90.561<−4.4165.20.203<−4.1388.9NA<−3.9987.70.440
Results in total after multiple optimizations
Multiple-step strategyAcc = 90.5%, sensitivity = 92.6%, specificity = 86.6%, MCC = 0.792 (n = 861)
AlgorithmsN, C-TerminusDomain linkerS1–S3Voltage sensorPore region
CutoffAcc (%)MCCCutoffAcc (%)MCCCutoffAcc (%)MCCCutoffAcc (%)MCCCutoffAcc (%)MCC
SIFT≤0.0564.20.350≤0.0560.50.335≤0.0574.80.446≤0.0588.90.254≤0.0588.00.521
MASS>1.93563.40.336>1.93544.20.335>1.93579.30.534>1.93585.90.153>1.93587.00.585
PP2-HDIV>0.565.90.400>0.563.40.444>0.574.80.426>0.587.90.125>0.586.10.440
PP2-HVAR>0.569.90.458>0.569.20.463>0.577.00.480>0.588.90.254>0.586.70.482
PROVEAN≤−2.565.90.400≤−2.552.90.414≤−2.574.80.446≤−2.585.90.065≤−2.592.20.674
SNAP2>064.20.350>062.80.440>074.80.430>090.90.406>090.70.603
MutPred2>0.572.40.483>0.569.80.497>0.570.40.322>0.589.90.286>0.590.10.550
I-Mutant 2.0<056.90.220<033.10.178<060.0−0.033<076.80.019<080.70.100
FATHMM-U<−3.069.10.382<−3.069.20.355<−3.079.30.596<−3.065.70.241<−3.065.40.351
FATHMM-W<−1.548.8NA<−1.533.10.252<−1.564.4NA<−1.588.9NA<−1.585.2NA
After optimization of cutoff value for with stratification
SIFT=072.40.455=077.30.384<0.0278.50.418≤0.0289.90.404≤0.0889.80.573
MASS>3.56582.90.677>3.86584.90.506>2.9183.70.648>0.4990.90.406>1.689.20.606
PP2-HDIV>0.92273.20.504>0.99780.80.469>0.89076.30.462>0.87289.90.359>0.7787.30.531
PP2-HVAR>0.96378.00.562>0.94782.00.533>0.47478.50.516>0.6790.90.4490.82387.30.574
PROVEAN<−2.03566.70.426≤−6.10379.70.326<−3.41481.50.586≤−1.89188.8NA≤−2.19792.50.685
SNAP2>5379.70.601>5786.60.589>6285.90.702>3292.90.580>−1191.00.604
MutPred2>0.72075.60.512>0.80587.20.600>0.85985.20.677>0.70090.90.417>0.44890.70.574
I-Mutant 2.0<−0.3763.40.312<−3.3977.9NA<2.864.4NA<2.3088.9NA<2.0485.2NA
FATHMM-U<−2.3270.70.420<−5.7379.10.215<−2.0383.70.637<−3.9388.9NA<4.9285.2NA
FATHMM-W<−4.1973.20.474<−3.8484.90.561<−4.4165.20.203<−4.1388.9NA<−3.9987.70.440
Results in total after multiple optimizations
Multiple-step strategyAcc = 90.5%, sensitivity = 92.6%, specificity = 86.6%, MCC = 0.792 (n = 861)

The data in bold highlights the best tool for prediction. Acc, accuracy; MCC, Matthews correlation coefficient; NA, not applicable.

Combination of in silico tools

To improve the prediction performance, we further applied combination strategy. The cutoff shifting of the top three tools for each NaV1.1 domain is shown in Supplementary Figure S3. As shown in Supplementary Table S5, combination of MASS, SNAP2 and PP2-HVAR tools achieved the best accuracy (86.2%) in the subgroup of N, C-Terminus, which was better than any single tool with MASS (82.9%). For domain linkers, combined use of MutPred2, SNAP2 and FATHMM-W exhibited the best performance, improving the accuracy from 87.2% to 89.5%. For the domain of S1–S3, SNAP2 collaborating with MutPred2 and MASS elevated the accuracy from 85.9% to 88.9%. Combination of the algorithms did not significantly improve prediction for variant in the voltage sensor and the pore region.

Ultimately, the predictive accuracy of SCN1A variants ranged from 86.2% to 92.9% for different functional domains, with a general accuracy of 90.5% (sensitivity = 92.6%, specificity = 86.6%) after multiple steps of optimization (Table 2). We then proposed a stepwise prediction strategy for SCN1A variants through the domain-oriented tool selection and cutoff resetting (Figure 2).

Step-by-step optimizations for predicting the damaging effect of SCN1A missense variants. Acc, accuracy.
Figure 2

Step-by-step optimizations for predicting the damaging effect of SCN1A missense variants. Acc, accuracy.

Validation of the proposed prediction strategy

To test the effectiveness of the proposed prediction strategy, we performed five-fold cross-validation on the SCN1A data Sets. As shown in Table 3, the stepwise prediction strategy for SCN1A highly improved predictive performance according to 5 rounds of average measures: accuracy (90.2% ± 2.1%), sensitivity (93.5% ± 3.3%), specificity (84.4% ± 4.5%), and MCC (0.788 ± 0.045). The training model based on all variants of SCN1A data set with accuracy of 90.5% and MCC of 0.792 was chosen as the final classifier. In addition, two SCN1A homogenous genes SCN2A and KCNQ2 were selected for further performance validation, which also help to explore the applicability of the proposed strategy to other genes. For SCN2A, 49 likely pathogenic variants from HGMD and 283 likely benign variants from ExAC were retrieved. Similarly, 70 likely pathogenic variants and 172 likely benign variants were collected for KCNQ2. Likewise, SCN2A and KCNQ2 exhibited high performance measures with five-fold cross-validation (Table 3).

Table 3

Five-fold cross validations of the model predictive ability on SCN1A, SCN2A, and KCNQ2 data sets

TestSCN1A data set (n = 861)SCN2A data set (n = 332)KCNQ2 data set (n = 242)
Acc (%)Sens (%)Spec (%)MCCAcc (%)Sens (%)Spec (%)MCCAcc (%)Sens (%)Spec (%)MCC
190.991.290.50.80688.288.088.30.73492.695.291.50.838
288.192.081.00.73883.572.088.30.60395.685.7100.00.898
391.598.279.40.81591.892.091.70.81194.195.293.60.868
488.190.384.10.74185.968.093.30.64894.185.797.90.861
592.695.687.30.83887.188.086.70.71292.695.291.50.838
Average90.2 ± 2.193.5 ± 3.384.4 ± 4.50.788 ± 0.04587.3 ± 3.081.6 ± 10.889.7 ± 2.70.702 ± 0.08093.8 ± 1.291.4 ± 5.294.9 ± 3.90.861 ± 0.024
TestSCN1A data set (n = 861)SCN2A data set (n = 332)KCNQ2 data set (n = 242)
Acc (%)Sens (%)Spec (%)MCCAcc (%)Sens (%)Spec (%)MCCAcc (%)Sens (%)Spec (%)MCC
190.991.290.50.80688.288.088.30.73492.695.291.50.838
288.192.081.00.73883.572.088.30.60395.685.7100.00.898
391.598.279.40.81591.892.091.70.81194.195.293.60.868
488.190.384.10.74185.968.093.30.64894.185.797.90.861
592.695.687.30.83887.188.086.70.71292.695.291.50.838
Average90.2 ± 2.193.5 ± 3.384.4 ± 4.50.788 ± 0.04587.3 ± 3.081.6 ± 10.889.7 ± 2.70.702 ± 0.08093.8 ± 1.291.4 ± 5.294.9 ± 3.90.861 ± 0.024

Sens, sensitivity; Spec, specificity.

Table 3

Five-fold cross validations of the model predictive ability on SCN1A, SCN2A, and KCNQ2 data sets

TestSCN1A data set (n = 861)SCN2A data set (n = 332)KCNQ2 data set (n = 242)
Acc (%)Sens (%)Spec (%)MCCAcc (%)Sens (%)Spec (%)MCCAcc (%)Sens (%)Spec (%)MCC
190.991.290.50.80688.288.088.30.73492.695.291.50.838
288.192.081.00.73883.572.088.30.60395.685.7100.00.898
391.598.279.40.81591.892.091.70.81194.195.293.60.868
488.190.384.10.74185.968.093.30.64894.185.797.90.861
592.695.687.30.83887.188.086.70.71292.695.291.50.838
Average90.2 ± 2.193.5 ± 3.384.4 ± 4.50.788 ± 0.04587.3 ± 3.081.6 ± 10.889.7 ± 2.70.702 ± 0.08093.8 ± 1.291.4 ± 5.294.9 ± 3.90.861 ± 0.024
TestSCN1A data set (n = 861)SCN2A data set (n = 332)KCNQ2 data set (n = 242)
Acc (%)Sens (%)Spec (%)MCCAcc (%)Sens (%)Spec (%)MCCAcc (%)Sens (%)Spec (%)MCC
190.991.290.50.80688.288.088.30.73492.695.291.50.838
288.192.081.00.73883.572.088.30.60395.685.7100.00.898
391.598.279.40.81591.892.091.70.81194.195.293.60.868
488.190.384.10.74185.968.093.30.64894.185.797.90.861
592.695.687.30.83887.188.086.70.71292.695.291.50.838
Average90.2 ± 2.193.5 ± 3.384.4 ± 4.50.788 ± 0.04587.3 ± 3.081.6 ± 10.889.7 ± 2.70.702 ± 0.08093.8 ± 1.291.4 ± 5.294.9 ± 3.90.861 ± 0.024

Sens, sensitivity; Spec, specificity.

Additional validation was performed in SCN1A variants of damage-confirmed and familial epilepsy. Forty-seven variants evidenced by in vitro functional studies and 58 variants associated with familial epilepsy were used as pathogenic sets, while the 25th percentile of SCN1A missense variants (n = 76) with higher MAFs from ExAC database were used as benign data set. Using our stepwise strategy, 44 of 47 pathogenic variants evidenced by electrophysiological measurements were suggested to be pathogenic, with higher accuracy (90.2%) and MCC score (0.802) than using any single algorithm or ensemble method (Table 4). The exceptional variants include T808S, T1174S, and R1575C, which are located outside the voltage sensor and pore region (at DIIS2, DII-DIII linker and DIVS2, respectively). For the validation of variants from familial epilepsy, the strategy was also superior to any single algorithm or ensemble method, with an accuracy of 85.8% (Table 4).

Table 4

Validation of the proposed strategy in SCN1A variants with functional confirmation and from familial epilepsy, and in variants of SCN2A and KCNQ2

AlgorithmsValidation using in vitro test data set (n = 123)Validation using familial data set (n = 134)Validation using SCN2A data set (n = 332)Validation using KCNQ2 data set (n = 242)
Acc (%)Sens (%)Spec (%)MCCAcc (%)Sens (%)Spec (%)MCCAcc (%)Sens (%)Spec (%)MCCAcc (%)Sens (%)Spec (%)MCC
SIFT65.095.746.10.44365.791.446.10.40575.963.378.10.32759.181.450.00.290
MASS61.889.444.70.35663.487.944.70.35157.583.753.00.26061.682.952.90.328
PP2-HDIV65.095.746.10.44366.493.146.10.42754.891.848.40.28952.192.935.50.288
PP2-HVAR69.995.753.90.50670.993.153.90.49456.091.849.80.29861.191.448.80.378
PROVEAN61.095.739.50.39063.494.839.50.39455.489.849.50.28176.092.969.20.563
SNAP265.995.747.40.45364.987.947.40.37563.667.362.90.21866.991.457.00.443
MutPred265.995.747.40.45367.293.147.40.43859.693.953.70.33876.995.769.20.589
I-Mutant 2.043.185.117.10.02947.887.917.10.07025.079.615.5−0.0539.380.022.70.029
FATHMM-U75.670.278.90.48770.158.678.90.38580.451.085.50.32569.452.976.20.282
FATHMM-W42.3100.06.60.16246.398.36.60.11615.1100.035.30.02040.998.617.40.217
CADD56.197.930.30.34559.198.230.30.36550.693.943.10.27149.695.730.80.285
CADD-Opta71.595.756.60.52867.482.156.60.39085.20100.0NA74.024.394.20.266
CADD-Optb77.293.667.10.59574.283.967.10.50788.936.797.90.47486.457.198.30.657
Eigen51.297.922.40.27852.392.922.40.20650.995.943.10.28652.590.037.20.271
Eigen-Opta74.887.267.10.52972.780.467.10.47086.710.2100.00.29776.934.394.20.371
Eigen-Optb78.089.471.10.58879.582.371.10.53190.153.196.40.56588.872.995.30.720
M-CAP40.2100.02.70.10243.8100.02.70.10722.198.09.00.0929.8100.059.50.042
M-CAP-Opta83.695.776.00.69980.085.576.00.60790.555.196.80.61481.571.485.70.592
M-CAP-Optb86.193.681.30.73082.383.681.30.64493.069.497.10.73290.871.498.80.774
Proposed strategyc90.293.688.20.80285.882.888.20.71186.477.688.00.56484.760.094.80.609
Proposed strategyd87.381.689.70.70293.891.494.90.861
AlgorithmsValidation using in vitro test data set (n = 123)Validation using familial data set (n = 134)Validation using SCN2A data set (n = 332)Validation using KCNQ2 data set (n = 242)
Acc (%)Sens (%)Spec (%)MCCAcc (%)Sens (%)Spec (%)MCCAcc (%)Sens (%)Spec (%)MCCAcc (%)Sens (%)Spec (%)MCC
SIFT65.095.746.10.44365.791.446.10.40575.963.378.10.32759.181.450.00.290
MASS61.889.444.70.35663.487.944.70.35157.583.753.00.26061.682.952.90.328
PP2-HDIV65.095.746.10.44366.493.146.10.42754.891.848.40.28952.192.935.50.288
PP2-HVAR69.995.753.90.50670.993.153.90.49456.091.849.80.29861.191.448.80.378
PROVEAN61.095.739.50.39063.494.839.50.39455.489.849.50.28176.092.969.20.563
SNAP265.995.747.40.45364.987.947.40.37563.667.362.90.21866.991.457.00.443
MutPred265.995.747.40.45367.293.147.40.43859.693.953.70.33876.995.769.20.589
I-Mutant 2.043.185.117.10.02947.887.917.10.07025.079.615.5−0.0539.380.022.70.029
FATHMM-U75.670.278.90.48770.158.678.90.38580.451.085.50.32569.452.976.20.282
FATHMM-W42.3100.06.60.16246.398.36.60.11615.1100.035.30.02040.998.617.40.217
CADD56.197.930.30.34559.198.230.30.36550.693.943.10.27149.695.730.80.285
CADD-Opta71.595.756.60.52867.482.156.60.39085.20100.0NA74.024.394.20.266
CADD-Optb77.293.667.10.59574.283.967.10.50788.936.797.90.47486.457.198.30.657
Eigen51.297.922.40.27852.392.922.40.20650.995.943.10.28652.590.037.20.271
Eigen-Opta74.887.267.10.52972.780.467.10.47086.710.2100.00.29776.934.394.20.371
Eigen-Optb78.089.471.10.58879.582.371.10.53190.153.196.40.56588.872.995.30.720
M-CAP40.2100.02.70.10243.8100.02.70.10722.198.09.00.0929.8100.059.50.042
M-CAP-Opta83.695.776.00.69980.085.576.00.60790.555.196.80.61481.571.485.70.592
M-CAP-Optb86.193.681.30.73082.383.681.30.64493.069.497.10.73290.871.498.80.774
Proposed strategyc90.293.688.20.80285.882.888.20.71186.477.688.00.56484.760.094.80.609
Proposed strategyd87.381.689.70.70293.891.494.90.861

The data in bold highlights the best single tool for prediction.

aIndividualized optimization of cutoff value for SCN1A, SCN2A and KCNQ2 data sets. b. Individualized optimization of cutoff value with sub-regional stratification for SCN1A, SCN2A, and KCNQ2 data sets. c. Performance of model trained by using SCN1A data sets. d. Performance of self-trained model by using data sets of SCN2A and KCNQ2, respectively.

Table 4

Validation of the proposed strategy in SCN1A variants with functional confirmation and from familial epilepsy, and in variants of SCN2A and KCNQ2

AlgorithmsValidation using in vitro test data set (n = 123)Validation using familial data set (n = 134)Validation using SCN2A data set (n = 332)Validation using KCNQ2 data set (n = 242)
Acc (%)Sens (%)Spec (%)MCCAcc (%)Sens (%)Spec (%)MCCAcc (%)Sens (%)Spec (%)MCCAcc (%)Sens (%)Spec (%)MCC
SIFT65.095.746.10.44365.791.446.10.40575.963.378.10.32759.181.450.00.290
MASS61.889.444.70.35663.487.944.70.35157.583.753.00.26061.682.952.90.328
PP2-HDIV65.095.746.10.44366.493.146.10.42754.891.848.40.28952.192.935.50.288
PP2-HVAR69.995.753.90.50670.993.153.90.49456.091.849.80.29861.191.448.80.378
PROVEAN61.095.739.50.39063.494.839.50.39455.489.849.50.28176.092.969.20.563
SNAP265.995.747.40.45364.987.947.40.37563.667.362.90.21866.991.457.00.443
MutPred265.995.747.40.45367.293.147.40.43859.693.953.70.33876.995.769.20.589
I-Mutant 2.043.185.117.10.02947.887.917.10.07025.079.615.5−0.0539.380.022.70.029
FATHMM-U75.670.278.90.48770.158.678.90.38580.451.085.50.32569.452.976.20.282
FATHMM-W42.3100.06.60.16246.398.36.60.11615.1100.035.30.02040.998.617.40.217
CADD56.197.930.30.34559.198.230.30.36550.693.943.10.27149.695.730.80.285
CADD-Opta71.595.756.60.52867.482.156.60.39085.20100.0NA74.024.394.20.266
CADD-Optb77.293.667.10.59574.283.967.10.50788.936.797.90.47486.457.198.30.657
Eigen51.297.922.40.27852.392.922.40.20650.995.943.10.28652.590.037.20.271
Eigen-Opta74.887.267.10.52972.780.467.10.47086.710.2100.00.29776.934.394.20.371
Eigen-Optb78.089.471.10.58879.582.371.10.53190.153.196.40.56588.872.995.30.720
M-CAP40.2100.02.70.10243.8100.02.70.10722.198.09.00.0929.8100.059.50.042
M-CAP-Opta83.695.776.00.69980.085.576.00.60790.555.196.80.61481.571.485.70.592
M-CAP-Optb86.193.681.30.73082.383.681.30.64493.069.497.10.73290.871.498.80.774
Proposed strategyc90.293.688.20.80285.882.888.20.71186.477.688.00.56484.760.094.80.609
Proposed strategyd87.381.689.70.70293.891.494.90.861
AlgorithmsValidation using in vitro test data set (n = 123)Validation using familial data set (n = 134)Validation using SCN2A data set (n = 332)Validation using KCNQ2 data set (n = 242)
Acc (%)Sens (%)Spec (%)MCCAcc (%)Sens (%)Spec (%)MCCAcc (%)Sens (%)Spec (%)MCCAcc (%)Sens (%)Spec (%)MCC
SIFT65.095.746.10.44365.791.446.10.40575.963.378.10.32759.181.450.00.290
MASS61.889.444.70.35663.487.944.70.35157.583.753.00.26061.682.952.90.328
PP2-HDIV65.095.746.10.44366.493.146.10.42754.891.848.40.28952.192.935.50.288
PP2-HVAR69.995.753.90.50670.993.153.90.49456.091.849.80.29861.191.448.80.378
PROVEAN61.095.739.50.39063.494.839.50.39455.489.849.50.28176.092.969.20.563
SNAP265.995.747.40.45364.987.947.40.37563.667.362.90.21866.991.457.00.443
MutPred265.995.747.40.45367.293.147.40.43859.693.953.70.33876.995.769.20.589
I-Mutant 2.043.185.117.10.02947.887.917.10.07025.079.615.5−0.0539.380.022.70.029
FATHMM-U75.670.278.90.48770.158.678.90.38580.451.085.50.32569.452.976.20.282
FATHMM-W42.3100.06.60.16246.398.36.60.11615.1100.035.30.02040.998.617.40.217
CADD56.197.930.30.34559.198.230.30.36550.693.943.10.27149.695.730.80.285
CADD-Opta71.595.756.60.52867.482.156.60.39085.20100.0NA74.024.394.20.266
CADD-Optb77.293.667.10.59574.283.967.10.50788.936.797.90.47486.457.198.30.657
Eigen51.297.922.40.27852.392.922.40.20650.995.943.10.28652.590.037.20.271
Eigen-Opta74.887.267.10.52972.780.467.10.47086.710.2100.00.29776.934.394.20.371
Eigen-Optb78.089.471.10.58879.582.371.10.53190.153.196.40.56588.872.995.30.720
M-CAP40.2100.02.70.10243.8100.02.70.10722.198.09.00.0929.8100.059.50.042
M-CAP-Opta83.695.776.00.69980.085.576.00.60790.555.196.80.61481.571.485.70.592
M-CAP-Optb86.193.681.30.73082.383.681.30.64493.069.497.10.73290.871.498.80.774
Proposed strategyc90.293.688.20.80285.882.888.20.71186.477.688.00.56484.760.094.80.609
Proposed strategyd87.381.689.70.70293.891.494.90.861

The data in bold highlights the best single tool for prediction.

aIndividualized optimization of cutoff value for SCN1A, SCN2A and KCNQ2 data sets. b. Individualized optimization of cutoff value with sub-regional stratification for SCN1A, SCN2A, and KCNQ2 data sets. c. Performance of model trained by using SCN1A data sets. d. Performance of self-trained model by using data sets of SCN2A and KCNQ2, respectively.

To explore the extensibility and reliability of the strategy to genes with similar function and homogenous functional domains, we tested the performance of model trained by SCN1A data sets for variants in SCN2A and KCNQ2. The model trained using SCN1A data set outperformed the traditional tools and ensemble methods with accuracies of 86.4% and 84.7% for SCN2A and KCNQ2, respectively (Table 4). However, it was noted that better performance was achieved by using self-training models of SCN2A and KCNQ2 (Table 3), in which the accuracies were 87.3% and 93.8%, respectively. Much improvement has been observed in predictions for variants in the N, C-terminus and domain linkers of SCN2A and variants in the pore regions of KCNQ2 (Supplementary Table S6 and S7).

We also evaluated the performance of recently developed ensemble tools M-CAP, CADD and Eigen in the data sets for comparison with our strategy. The accuracy rates for variants in these channel genes ranged from 22.1% to 43.8% for M-CAP, 49.6% to 59.1% for CADD and 50.9% to 52.3% for Eigen (Table 4) if they were used in a regular way without individualization. These tools revealed generally high sensitivities but low specificities. Remarkable improvements in predictive performance were observed when they were used with individualization for each gene, and further improvement followed the sub-regional stratification (Table 4), indicating the applicability of the individualized strategy.

Possible prediction of functional alterations and phenotype severity

Missense mutations in SCN1A result in diverse functional alterations that contribute to the pathogenesis of epilepsy [27–30]. These alterations had been categorized as complete loss of function (LOF), partial LOF (pLOF), gain of function (GOF), and gain-and-loss of function (G-LOF), according to altered current and channel kinetics [7, 16]. We then attempted to explore the possible link between in silico prediction and functional outcome of SCN1A missense mutations. The 47 variants with functional data from electrophysiological measurements were subgrouped according to their functional alteration. Prediction scores of the 47 pathogenic variants and 305 benign variants were obtained by using SNAP2, which had the highest accuracy, MCC score, and AUC score after optimization. The mean prediction score of the mutations significantly differed from that of benign variants (P < 0.05, one-way ANOVA). However, post hoc LSD test did not show differences in prediction scores during inter-group comparison of the four functional subgroups. We then attempted to rank the subgroups with functional alteration by their theoretical impairment of channel function as GOF (associated with milder phenotypes), G-LOF, pLOF and LOF (associated with the severe phenotypes) [7]. Spearman correlation analysis showed a weak positive linear relationship (r = 0.408, P = 0.004, n = 47) between SNAP2 prediction scores and channel function alterations (Figure 3A).

Potential associations between predictive score and functional alteration or phenotypic severity. The SCN1A variants with functional alteration confirmed (A) or documented phenotype (B) were analyzed. Predictive scores were generated by the representative tool SNAP2. According to the Spearman’s rank correlation analysis, the predictive scores had a positive correlation with ranks of functional alteration (r = 0.408, P = 0.004) and phenotypic severity (r = 0.670, P = 0.000).
Figure 3

Potential associations between predictive score and functional alteration or phenotypic severity. The SCN1A variants with functional alteration confirmed (A) or documented phenotype (B) were analyzed. Predictive scores were generated by the representative tool SNAP2. According to the Spearman’s rank correlation analysis, the predictive scores had a positive correlation with ranks of functional alteration (r = 0.408, P = 0.004) and phenotypic severity (r = 0.670, P = 0.000).

Missense mutations in the SCN1A gene are associated with a spectrum of epilepsy-related disorders. However, a clear genotype-phenotype association has not been established. Here we tried to explore the possible correlation between phenotype severity and genetic variants in a quantitative model of in silico prediction. A total of 541 mutations were subgrouped into three groups according to the associated phenotypes, which were ranked by clinical severity [7]. The FS or GEFS+ was categorized into mild group, while partial epilepsy or partial epilepsy with febrile seizures plus was categorized into moderate group. The severe group included Dravet syndrome and other forms of epileptic encephalopathies. The SNAP2 algorithm revealed that the mean prediction scores of epilepsy-associated variants differed significantly from that of benign variants (P < 0.05, one-way ANOVA); and there was a high positive correlation between in silico prediction scores and phenotypic severity (r = 0.670, P = 0.000, n = 846, Spearman's rank correlation) (Figure 3B). The post-hoc LSD test indicated that the prediction scores of both mild and moderate subgroups significantly differed from that of severe group (P < 0.05).

Discussion

In this study, we evaluated the predictive efficacy of 10 commonly used in silico algorithms for SCN1A variants. The performance of these algorithms was less satisfactory for the specific gene SCN1A when compared with the general performance that was previously claimed for all genes. Retraining with stepwise approaches, including cutoff resetting, domain-based stratification and combination of predicting tools, increased the accuracy rates and MCC scores. The individualizing strategy with sub-regional stratification also improved the performance of the ensemble algorithms like M-CAP, CADD and Eigen, which were validated in two completely independent test sets, SCN2A and KCNQ2. This study highlights the need of individualized optimization for genes with molecular sub-regional stratification in practice.

To date, many prediction tools have been developed. Current tools are developed for application to all variants throughout the genome. Training by all variants, irrespective of gene discrepancy and functional complexity, may encounter the problems of underfitting, in which data were mixed up and lost details to fit the structure-function relation of a specific gene. In this respect, these all-purpose models may be too simple to fit well with variants in a specific gene. These algorithms generated relatively high error rates with low specificities and low MCC scores for SCN1A missense variants when they were conventionally used without optimization. Any of the algorithms did not reach 90% in accuracy for predicting pathogenic classification of SCN1A variants, which was proposed by the American College of Medical Genetics and Genomics [25]. The results are consistent with a previous study that showed high rates of error from in silico derived classifications [30]. Thus, stereotypical applications of these tools are not sufficient for predicting the damaging effect of variants in a specific gene like SCN1A.

New strategies or optimized ensemble algorithms such as M-CAP have been developed. M-CAP employs a combination strategy by collaborating classical tools (including SIFT, Polyphen-2 and CADD) in a novel model and outperforms the single algorithms with high sensitivity [20]. However, M-CAP revealed low specificity with an accuracy of only 40.2% for the SCN1A variants evidenced by in vitro functional studies and 43.8% for variants associated with familial epilepsy, if it was used in a regular way without individualized optimization (Table 4). Previously, through adjusting the threshold of M-CAP, better performance had been obtained for missense variants of sodium channel (SCN) genes [30]. Further improvement had been observed by using an individualized predictor called SCN index, which was based on weighted combination of the M-CAP and several classical algorithms with logistic regression. However, it appeared to be less efficient for variants in KCNQ2 and KCNQ3 with an accuracy of less than 80% [30]. These findings suggested the need for individualized optimization.

How to optimize the predicting tools is still a major challenge in practice. Generally, proteins consist of several structural domains that are associated with distinct functional roles and potentially have different sensitivities to genetic defects. Our previous study indicated that the functional domains of SCN1A are associated with distinct functional impairments [16], which are related to clinical phenotype [7, 26]. In this study, improvement in prediction of SCN1A variants was obtained by individualizing the cutoffs (Table 1). Further improvement in prediction has been achieved when the prediction was individualized according to the locations of variants in functional domains (Table 2). It is noted that the prediction accuracy was particularly high for variants in some regions (Table 2, Supplementary Table S6 and S7). In the validation data set of SCN1A variants that were functionally determined, the actual prediction accuracy was 100% for variants in the voltage sensor and pore regions. Similar results were obtained for variants in KCNQ2 (Supplementary Table S7). These results suggest that stratification by functional domains should be a critical consideration in predicting the damaging effect of genetic variants.

In this study, we used the prediction scores as a mathematical description of genetic variants and explore the relationship between prediction scores and the functional defects/phenotypic severity of SCN1A variants. Spearman correlation analysis showed linear relationship between the SNAP2 prediction scores and the ranked functional defects/phenotype severity. This study also presented 40 SCN1A variants that overlapped between the mutation databases and SNP database. These overlapped variants did not show significant difference with benign missense variants in MAFs, and their prediction scores also presented intermediately (Supplementary Figure S1). Therefore, it appears appropriate to set a medial ‘possible pathogenic’ category for this kind of variants in practical use of the prediction algorithms.

In practical application, several additional points should be considered

First, how this proposed strategy could be generalized to other genes. The proposed strategy is a process for calibration of underfitting algorithms. By individualized calibration with molecular sub-regional stratification, the performance of existing in silico algorithms improved apparently, suggesting the approach were generally applicable. However, how to individualize for each gene is a major task. The present study demonstrated that individualization with stratification by molecular sub-regions as pore region, voltage sensor, D-linkers, etc., was applicable for SCN1A, SCN2A and KCNQ2, which are voltage-gated ion channels and share similar functional sub-regions. However, genes are distinct in their structure–function relation. Our recent study showed that several genes, such as ARHGEF9 [31], DEPDC5 and GRIN1 (unpublished data), have completely different molecular sub-regions with distinct pathogenic significance. Therefore, individualization with stratification by molecular sub-regions would be, generally, a principle, but determining the pathogenically significant sub-regions of each gene needs further fine work. On the other hand, this approach needs training data of variants that are known to be disease-causing or benign, thus may have difficulties in applying to uncharacterized genes. The present study demonstrated that model trained using SCN1A data set outperformed the traditional and ensemble methods for variants of SCN2A and KCNQ2 (Table 4), suggesting that a model trained by data sets from a gene with known disease causing or benign variants was possibly applicable to genes with similar function and homogenous functional domains. However, it should be noted that better performance have been achieved by using self-training models for SCN2A and KCNQ2 variants. Thus, we propose to define the sub-molecular pathogenic implication of each gene and then train the model accordingly. In case of insufficiency of data set with known disease-causing or benign variants, model trained using data from a gene with similar function and homogenous functional domains may be tried.

Second, the balance between sensitivity and specificity is a practical issue. This study showed that several algorithms have advantages in sensitivity, such as M-CAP, to which priority should be given in data filtrations. In evaluating pathogenicity of variants, however, accuracy should be emphasized, as suggested by ACMG [25].

Third, the prediction indexes may be influenced by selection of the variants for building pathogenic or benign test set. The effects of genetic variants vary and potentially present a continuous distribution with overlaps between pathogenic variants and benign variants, as shown by the distribution of the prediction scores (Figure 1). Using a clear-cut indicator, such as MAF, to select variants at the two extremes to build pathogenic test set or benign test set may produce ‘false’ high accuracy for a prediction model [30]. Similarly, a clear-cut damaging/benign prediction appeared to be difficult for some variants in practical application, and a medial ‘possible pathogenic’ set appeared appropriate.

Fourth, it is unknown exactly why the current predicting tools perform so differently. These tools employ evolutionary, sequence or structural information to characterize residue substitutions; and predictions are obtained via mathematical, rule-based, or statistical learning methods [1, 32]. The present study demonstrated that MutPred2, PP2-HVAR and SNAP2, which are sequence–structure based [33–35], achieved relatively higher discriminative power for SCN1A variants. PROVEAN is mainly based on sequence information [36] and revealed higher accuracy for variants in the pore regions of SCN1A, for which the high conservative sequence in this region may be one of the explanations. Our study emphasizes applicable strategies to calibrate the models with molecular sub-regional stratification, while keeping the application of the available in silico tools at the current stage. However, it is necessary to develop sensitive algorithms for specific functional domains and the related individualized softwares in the future.

In conclusion, since each gene is unique in functional role, function–structure relation and sensitivity to variation, individualized optimization of prediction for variants in specific gene is required. We provided an individualized strategy with molecular sub-regional stratification for SCN1A, which was also applicable for voltage-gated channel genes, SCN2A and KCNQ2. It is suggested that the principle of sub-regional stratification is potentially applicable to genes with similar significant sub-regions. Further fine work is required to determine the pathogenically significant sub-regions of each gene, which would help to develop individualized software for variant prediction in the future.

Key Points

  • Individualized optimization with molecular sub-regional stratification increased the predictive performance of current algorithms for variants in SCN1A.

  • The proposed strategy was applicable to voltage-gated channel genes SCN2A and KCNQ2, suggesting that the principle of sub-regional stratification is potentially applicable to genes with similar pathogenically significant sub-regions.

  • Prediction scores were linearly correlated with the degree of functional defects and the severity of clinical phenotypes and could be a mathematical description of genotypes.

Funding

This work was supported by Omics-based precision medicine of epilepsy being entrusted by Key Research Project of the Ministry of Science and Technology of China (grant 2016YFC0904400), the National Natural Science Foundation of China (grant 81501125, 81571273 and 81571274), internet medical innovation platform project of health commission of Guangdong province: biomedical large database of genetic pathogenicity analysis, Science and Technology Project of Guangzhou (grant 201804020046, 201508020011, 201604020161 and 201607010002). The funders had no role in study design, data collection, data analysis, and decision to prepare or publish the manuscript.

Bin Tang is a research fellow at the Institute of Neuroscience and the Second Affiliated Hospital of Guangzhou Medical University.

Bin Li is a MD candidate at the Institute of Neuroscience and the Second Affiliated Hospital of Guangzhou Medical University.

Liang-Di Gao is a research fellow at the Institute of Neuroscience and the Second Affiliated Hospital of Guangzhou Medical University.

Na He is a physician at the Institute of Neuroscience and Department of Neurology of the Second Affiliated Hospital of Guangzhou Medical University.

Xiao-Rong Liu is a professor at the Institute of Neuroscience and the Second Affiliated Hospital of Guangzhou Medical University.

Yue-Sheng Long is a professor at the Institute of Neuroscience and the Second Affiliated Hospital of Guangzhou Medical University.

Yang Zeng is a research fellow at the Institute of Neuroscience and the Second Affiliated Hospital of Guangzhou Medical University.

Yong-Hong Yi is a professor at the Institute of Neuroscience and Department of Neurology of the Second Affiliated Hospital of Guangzhou Medical University.

Tao Su is an associate professor at the Institute of Neuroscience and the Second Affiliated Hospital of Guangzhou Medical University.

Wei-Ping Liao is the director of the Institute of Neuroscience and the Second Affiliated Hospital of Guangzhou Medical University and an Associated Editor of SEIZURE. His major researches are on genetics. He won Ambassador for Epilepsy among others.

References

1.

Rodrigues
 
C
,
Santos-Silva
 
A
,
Costa
 
E
, et al.  
Performance of in Silico tools for the evaluation of UGT1A1 missense variants
.
Hum Mutat
 
2015
;
36
:
1215
25
.

2.

Leong
 
IU
,
Stuckey
 
A
,
Lai
 
D
, et al.  
Assessment of the predictive accuracy of five in silico prediction tools, alone or in combination, and two metaservers to classify long QT syndrome gene mutations
.
BMC Med Genet
 
2015
;
16
:
34
.

3.

Kerr
 
ID
,
Cox
 
HC
,
Moyes
 
K
, et al.  
Assessment of in silico protein sequence analysis in the clinical classification of variants in cancer risk genes
.
J Community Genet
 
2017
;
8
:
87
95
.

4.

Chen
 
YJ
,
Shi
 
YW
,
Xu
 
HQ
, et al.  
Electrophysiological differences between the same pore region mutation in SCN1A and SCN3A
.
Mol Neurobiol
 
2015
;
51
:
1263
70
.

5.

Meisler
 
MH
,
Kearney
 
JA
.
Sodium channel mutations in epilepsy and other neurological disorders
.
J Clin Invest
 
2005
;
115
:
2010
7
.

6.

Mulley
 
JC
,
Scheffer
 
IE
,
Petrou
 
S
, et al.  
SCN1A mutations and epilepsy
.
Hum Mutat
 
2005
;
25
:
535
42
.

7.

Meng
 
H
,
Xu
 
HQ
,
Yu
 
L
, et al.  
The SCN1A mutation database: updating information and analysis of the relationships among genotype, functional alteration, and phenotype
.
Hum Mutat
 
2015
;
36
:
573
80
.

8.

Ohmori
 
I
,
Ouchida
 
M
,
Ohtsuka
 
Y
, et al.  
Significant correlation of the SCN1A mutations and severe myoclonic epilepsy in infancy
.
Biochem Biophys Res Commun
 
2002
;
295
:
17
23
.

9.

Claes
 
L
,
Del-Favero
 
J
,
Ceulemans
 
B
, et al.  
De novo mutations in the sodium-channel gene SCN1A cause severe myoclonic epilepsy of infancy
.
Am J Hum Genet
 
2001
;
68
:
1327
32
.

10.

Escayg
 
A
,
MacDonald
 
BT
,
Meisler
 
MH
, et al.  
Mutations of SCN1A, encoding a neuronal sodium channel, in two families with GEFS+2
.
Nat Genet
 
2000
;
24
:
343
5
.

11.

Scheffer
 
IE
,
Berkovic
 
SF
.
Generalized epilepsy with febrile seizures plus. A genetic disorder with heterogeneous clinical phenotypes
.
Brain
 
1997
;
120
(
Pt 3
):
479
90
.

12.

Marini
 
C
,
Scheffer
 
IE
,
Nabbout
 
R
, et al.  
The genetics of Dravet syndrome
.
Epilepsia
 
2011
;
52
(
Suppl 2
):
24
9
.

13.

Scheffer
 
IE
,
Zhang
 
YH
,
Jansen
 
FE
, et al.  
Dravet syndrome or genetic (generalized) epilepsy with febrile seizures plus?
 
Brain Dev
 
2009
;
31
:
394
400
.

14.

Wei
 
F
,
Yan
 
LM
,
Su
 
T
, et al.  
Ion Channel genes and epilepsy: functional alteration, pathogenic potential, and mechanism of epilepsy
.
Neurosci Bull
 
2017
;
33
:
455
77
.

15.

He
 
N
,
Lin
 
ZJ
,
Wang
 
J
, et al.  
Evaluating the pathogenic potential of genes with de novo variants in epileptic encephalopathies
.
Genet Med
 
2019
;
21
:
17
27
.

16.

Liao
 
WP
,
Shi
 
YW
,
Long
 
YS
, et al.  
Partial epilepsy with antecedent febrile seizures and seizure aggravation by antiepileptic drugs: associated with loss of function of Na(v) 1.1
.
Epilepsia
 
2010
;
51
:
1669
78
.

17.

Catterall
 
WA
.
From ionic currents to molecular mechanisms
.
Neuron
 
2000
;
26
:
13
25
.

18.

Stenson
 
PD
,
Mort
 
M
,
Ball
 
EV
, et al.  
The human gene mutation database: towards a comprehensive repository of inherited mutation data for medical research, genetic diagnosis and next-generation sequencing studies
.
Hum Genet
 
2017
;
136
:
665
77
.

19.

Lek
 
M
,
Karczewski
 
KJ
,
Minikel
 
EV
, et al.  
Analysis of protein-coding genetic variation in 60,706 humans
.
Nature
 
2016
;
536
:
285
91
.

20.

Jagadeesh
 
KA
,
Paggi
 
JM
,
Ye
 
JS
, et al.  
S-CAP extends pathogenicity prediction to genetic variants that affect RNA splicing
.
Nat Genet
 
2019
;
51
:
755
63
.

21.

Jagadeesh
 
KA
,
Wenger
 
AM
,
Berger
 
MJ
, et al.  
M-CAP eliminates a majority of variants of uncertain significance in clinical exomes at high sensitivity
.
Nat Genet
 
2016
;
48
:
1581
6
.

22.

Rentzsch
 
P
,
Witten
 
D
,
Cooper
 
GM
, et al.  
CADD: predicting the deleteriousness of variants throughout the human genome
.
Nucleic Acids Res
 
2019
;
47
:
D886
d894
.

23.

Ionita-Laza
 
I
,
McCallum
 
K
,
Xu
 
B
, et al.  
A spectral approach integrating functional genomic annotations for coding and noncoding variants
.
Nat Genet
 
2016
;
48
:
214
20
.

24.

Ray
 
P
,
Le
 
Y
,
Riou
 
B
, et al.  
Statistical evaluation of a biomarker
.
Anesthesiology
 
2010
;
112
:
1023
40
.

25.

Richards
 
S
,
Aziz
 
N
,
Bale
 
S
, et al.  
Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology
.
Genet Med
 
2015
;
17
:
405
24
.

26.

Kanai
 
K
,
Hirose
 
S
,
Oguni
 
H
, et al.  
Effect of localization of missense mutations in SCN1A on epilepsy phenotype severity
.
Neurology
 
2004
;
63
:
329
34
.

27.

Escayg
 
A
,
Goldin
 
AL
.
Sodium channel SCN1A and epilepsy: mutations and mechanisms
.
Epilepsia
 
2010
;
51
:
1650
8
.

28.

Waxman
 
SG
.
Channel, neuronal and clinical function in sodium channelopathies: from genotype to phenotype
.
Nat Neurosci
 
2007
;
10
:
405
.

29.

Steinlein
 
OK
.
Genetic mechanisms that underlie epilepsy
.
Nat Rev Neurosci
 
2004
;
5
:
400
.

30.

Holland
 
KD
,
Bouley
 
TM
,
Horn
 
PS
.
Comparison and optimization of in silico algorithms for predicting the pathogenicity of sodium channel variants in epilepsy
.
Epilepsia
 
2017
;
58
:
1190
8
.

31.

Wang
 
JY
,
Zhou
 
P
,
Wang
 
J
, et al.  
ARHGEF9 mutations in epileptic encephalopathy/intellectual disability: toward understanding the mechanism underlying phenotypic variation
.
Neurogenetics
 
2018
;
19
:
9
16
.

32.

Tang
 
H
,
Thomas
 
PD
.
Tools for predicting the functional impact of nonsynonymous genetic variation
.
Genetics
 
2016
;
203
:
635
47
.

33.

Pejaver
 
V
,
Mooney
 
SD
,
Radivojac
 
P
.
Missense variant pathogenicity predictors generalize well across a range of function-specific prediction challenges
.
Hum Mutat
 
2017
;
38
:
1092
108
.

34.

Adzhubei
 
IA
,
Schmidt
 
S
,
Peshkin
 
L
, et al.  
A method and server for predicting damaging missense mutations
.
Nat Methods
 
2010
;
7
:
248
9
.

35.

Hecht
 
M
,
Bromberg
 
Y
,
Rost
 
B
.
Better prediction of functional effects for sequence variants
.
BMC Genomics
 
2015
;
16
:
S1
.

36.

Choi
 
Y
,
Chan
 
AP
.
PROVEAN web server: a tool to predict the functional effect of amino acid substitutions and indels
.
Bioinformatics
 
2015
;
31
:
2745
7
.

Author notes

Bin Tang and Bin Li contributed equally to this work.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Supplementary data