Testing calibration of phenotyping models using positive-only electronic health record data Free

Contingency table of nonparametric vs. parametric estimates of number of cases in intervals of estimated probabilities. Sample size |$N=50,000$|⁠. Estimates are averaged over 1000 replications. “Correct”: fitted model correctly specified; “Minor”: fitted model missing weak predictors |$(X_1, X_2, X_3)$|⁠; “Moderate”: fitted model missing moderate predictors |$(X_4, X_5, X_6)$|⁠; “Severe”: fitted model missing weak and moderate predictors |$(X_1 -X_6)$|⁠; “Npara”: nonparametric; “Para”: parametric.

	Correct		Minor		Moderate		Severe
Interval	Npara	Para	Npara	Para	Npara	Para	Npara	Para
0.0_0.1	98	98	122	124	326	335	338	348
0.1_0.2	100	100	133	126	391	365	413	385
0.2_0.3	108	109	140	135	391	375	411	396
0.3_0.4	121	120	151	149	395	392	415	411
0.4_0.5	138	137	168	169	410	417	428	435
0.5_0.6	162	162	194	198	438	454	453	470
0.6_0.7	200	200	238	244	483	507	496	522
0.7_0.8	268	268	314	323	562	589	572	600
0.8_0.9	427	427	492	503	723	739	721	736
0.9_1.0	3483	3483	3198	3178	1331	1278	1239	1185

	Correct		Minor		Moderate		Severe
Interval	Npara	Para	Npara	Para	Npara	Para	Npara	Para
0.0_0.1	98	98	122	124	326	335	338	348
0.1_0.2	100	100	133	126	391	365	413	385
0.2_0.3	108	109	140	135	391	375	411	396
0.3_0.4	121	120	151	149	395	392	415	411
0.4_0.5	138	137	168	169	410	417	428	435
0.5_0.6	162	162	194	198	438	454	453	470
0.6_0.7	200	200	238	244	483	507	496	522
0.7_0.8	268	268	314	323	562	589	572	600
0.8_0.9	427	427	492	503	723	739	721	736
0.9_1.0	3483	3483	3198	3178	1331	1278	1239	1185

Table 1.

Contingency table of nonparametric vs. parametric estimates of number of cases in intervals of estimated probabilities. Sample size |$N=50,000$|⁠. Estimates are averaged over 1000 replications. “Correct”: fitted model correctly specified; “Minor”: fitted model missing weak predictors |$(X_1, X_2, X_3)$|⁠; “Moderate”: fitted model missing moderate predictors |$(X_4, X_5, X_6)$|⁠; “Severe”: fitted model missing weak and moderate predictors |$(X_1 -X_6)$|⁠; “Npara”: nonparametric; “Para”: parametric.

	Correct		Minor		Moderate		Severe
Interval	Npara	Para	Npara	Para	Npara	Para	Npara	Para
0.0_0.1	98	98	122	124	326	335	338	348
0.1_0.2	100	100	133	126	391	365	413	385
0.2_0.3	108	109	140	135	391	375	411	396
0.3_0.4	121	120	151	149	395	392	415	411
0.4_0.5	138	137	168	169	410	417	428	435
0.5_0.6	162	162	194	198	438	454	453	470
0.6_0.7	200	200	238	244	483	507	496	522
0.7_0.8	268	268	314	323	562	589	572	600
0.8_0.9	427	427	492	503	723	739	721	736
0.9_1.0	3483	3483	3198	3178	1331	1278	1239	1185

	Correct		Minor		Moderate		Severe
Interval	Npara	Para	Npara	Para	Npara	Para	Npara	Para
0.0_0.1	98	98	122	124	326	335	338	348
0.1_0.2	100	100	133	126	391	365	413	385
0.2_0.3	108	109	140	135	391	375	411	396
0.3_0.4	121	120	151	149	395	392	415	411
0.4_0.5	138	137	168	169	410	417	428	435
0.5_0.6	162	162	194	198	438	454	453	470
0.6_0.7	200	200	238	244	483	507	496	522
0.7_0.8	268	268	314	323	562	589	572	600
0.8_0.9	427	427	492	503	723	739	721	736
0.9_1.0	3483	3483	3198	3178	1331	1278	1239	1185

Table 3.

Type I error rates of the proposed calibration tests when we partition the estimated probabilities into |$K=5,10,12$| groups at different sample sizes.

	\|$K=$\|	5	10	12
\|$T^{q}$\|	\|$N=30 000$\|	0.045	0.046	0.042
	\|$N=50 000$\|	0.042	0.047	0.045
	\|$N=80 000$\|	0.042	0.049	0.039
\|$T^{nq}$\|	\|$N=30 000$\|	0.041	0.048	0.045
	\|$N=50 000$\|	0.049	0.046	0.049
	\|$N=80 000$\|	0.048	0.048	0.049

	\|$K=$\|	5	10	12
\|$T^{q}$\|	\|$N=30 000$\|	0.045	0.046	0.042
	\|$N=50 000$\|	0.042	0.047	0.045
	\|$N=80 000$\|	0.042	0.049	0.039
\|$T^{nq}$\|	\|$N=30 000$\|	0.041	0.048	0.045
	\|$N=50 000$\|	0.049	0.046	0.049
	\|$N=80 000$\|	0.048	0.048	0.049

Table 3.

Open in new tab Download slide

Type I error rates of the proposed calibration tests when we partition the estimated probabilities into |$K=5,10,12$| groups at different sample sizes.

	\|$K=$\|	5	10	12
\|$T^{q}$\|	\|$N=30 000$\|	0.045	0.046	0.042
	\|$N=50 000$\|	0.042	0.047	0.045
	\|$N=80 000$\|	0.042	0.049	0.039
\|$T^{nq}$\|	\|$N=30 000$\|	0.041	0.048	0.045
	\|$N=50 000$\|	0.049	0.046	0.049
	\|$N=80 000$\|	0.048	0.048	0.049

	\|$K=$\|	5	10	12
\|$T^{q}$\|	\|$N=30 000$\|	0.045	0.046	0.042
	\|$N=50 000$\|	0.042	0.047	0.045
	\|$N=80 000$\|	0.042	0.049	0.039
\|$T^{nq}$\|	\|$N=30 000$\|	0.041	0.048	0.045
	\|$N=50 000$\|	0.049	0.046	0.049
	\|$N=80 000$\|	0.048	0.048	0.049

To evaluate the performance in terms of power, we deliberately fitted three misspecified models by omitting some predictors out of the model in order to demonstrate different levels of calibration: minor, which omitted weak predictors |$(X_1,X_2,X_3)$|⁠; moderate, which omitted moderate predictors |$(X_4,X_5,X_6)$|⁠; severe, which omitted both the weak and moderate predictors (⁠|$X_1$| through |$X_6$|⁠) from the model in (3.1). Note that poor calibration can be caused by many different reasons, we hereby use these three misspecified models as simple illustration examples. As shown in the calibration plots (Figure 3), contingency table (Table 1) and calibration table (Table 2), the minor misspecified model is not poorly calibrated, while poor calibration is demonstrated in the moderate and severe misspecified models. The level of disagreement between nonparametrically and parametrically estimated number of cases, i.e., indication of poor calibration, increased as the misspecification became more severe. Take probability interval 0.9–1.0 as an example, the difference between the nonparametric and parametric estimates of the number of cases increased from 20 to 53 and 54 respectively. The results in Table 4 also suggest that the fitted models weren’t poorly calibrated when only weak predictors were left out, with power less than 0.24 even when the sample size |$N$| was increased to 80 000. When moderate predictors were left out, our proposed calibration tests were able to flag the poor calibration. For example, at |$K=10$| and |$N=50\,000$|⁠, the power was 0.69 and 0.73 when |$q$| is known, and was 0.59 and 0.63 when |$q$| is unknown, when moderate, or both weak and moderate predictors failed to be included in the fitted model respectively. The power also improved when the sample size increased as we expected. For example, when both weak and moderate predictors were missing from the model, at |$K=10$|⁠, the power increased from 0.46 to 0.92 as the sample size increased from 30 000 to 80 000 when |$q$| was known, and increased from 0.38 to 0.87 when |$q$| was unknown (Table 4). As expected, knowing the phenotype prevalence |$q$| also helped increase power of the calibration test. For example, the power for detecting severe misspecification increased from 0.87 to 0.92 when |$N=80\,000$| and |$K=10$|⁠. As illustrated in Table 4, the power of the proposed calibration tests |$T^q$| and |$T^{nq}$| did not rely much on the selection of the number of subgroups |$K$|⁠, as the observed power was very similar across different |$K$| values, |$5, 10,$| and |$12$|⁠. As shown in Table S1 of the Supplementary material available at Biostatistics online, the calibration slopes estimated using positive-only data were very close to their respective benchmarks, which were calculated using fully observed |$Y$|⁠, regardless of whether the fitted model was correct or misspecified.

$Calibration plot of the three mis-specified models (from left to right: minor, moderate, severe) with $N=50\,000$.$

Fig. 3

Calibration plot of the three mis-specified models (from left to right: minor, moderate, severe) with |$N=50\,000$|⁠.

Table 2.

Calibration results for the fitted models with |$K=10$| and |$N=50,000$|⁠. “Correct”: fitted model correctly specified; “Minor”: fitted model missing weak predictors |$(X_1, X_2, X_3)$|⁠; “Moderate”: fitted model missing moderate predictors |$(X_4, X_5, X_6)$|⁠; “Severe”: fitted model missing weak and moderate predictors |$(X_1 -X_6)$|⁠. |$T^{q}$| and |$T^{nq}$| are averaged over 1000 replications; p-value is calculated based on the averaged statistics |$T^{q}$| and |$T^{nq}$|⁠.

	Correct		Minor		Moderate		Severe
	\|$T^{q}$\|	\|$p$\|-value	\|$T^{q}$\|	\|$p$\|-value	\|$T^{q}$\|	\|$p$\|-value	\|$T^{q}$\|	\|$p$\|-value
When \|$q$\| is known	9.66	0.47	12.12	0.28	23.28	0.01	23.99	0.01
When \|$q$\| is unknown	8.64	0.47	10.88	0.28	19.56	0.02	20.21	0.02

	Correct		Minor		Moderate		Severe
	\|$T^{q}$\|	\|$p$\|-value	\|$T^{q}$\|	\|$p$\|-value	\|$T^{q}$\|	\|$p$\|-value	\|$T^{q}$\|	\|$p$\|-value
When \|$q$\| is known	9.66	0.47	12.12	0.28	23.28	0.01	23.99	0.01
When \|$q$\| is unknown	8.64	0.47	10.88	0.28	19.56	0.02	20.21	0.02

Table 2.

Calibration results for the fitted models with |$K=10$| and |$N=50,000$|⁠. “Correct”: fitted model correctly specified; “Minor”: fitted model missing weak predictors |$(X_1, X_2, X_3)$|⁠; “Moderate”: fitted model missing moderate predictors |$(X_4, X_5, X_6)$|⁠; “Severe”: fitted model missing weak and moderate predictors |$(X_1 -X_6)$|⁠. |$T^{q}$| and |$T^{nq}$| are averaged over 1000 replications; p-value is calculated based on the averaged statistics |$T^{q}$| and |$T^{nq}$|⁠.

	Correct		Minor		Moderate		Severe
	\|$T^{q}$\|	\|$p$\|-value	\|$T^{q}$\|	\|$p$\|-value	\|$T^{q}$\|	\|$p$\|-value	\|$T^{q}$\|	\|$p$\|-value
When \|$q$\| is known	9.66	0.47	12.12	0.28	23.28	0.01	23.99	0.01
When \|$q$\| is unknown	8.64	0.47	10.88	0.28	19.56	0.02	20.21	0.02

	Correct		Minor		Moderate		Severe
	\|$T^{q}$\|	\|$p$\|-value	\|$T^{q}$\|	\|$p$\|-value	\|$T^{q}$\|	\|$p$\|-value	\|$T^{q}$\|	\|$p$\|-value
When \|$q$\| is known	9.66	0.47	12.12	0.28	23.28	0.01	23.99	0.01
When \|$q$\| is unknown	8.64	0.47	10.88	0.28	19.56	0.02	20.21	0.02

Table 4.

Power of the proposed calibration test for detecting model misspecification at different numbers of subgroups |$K$| and sample size |$N$|⁠. For model misspecification: “Minor”: fitted model missing weak predictors |$(X_1,X_2,X_3)$|⁠; “Moderate”: fitted model missing moderate predictors |$(X_4,X_5,X_6)$|⁠; “Severe”: fitted model missing weak and moderate predictors |$(X_1 - X_6)$|⁠.

		Minor			Moderate			Severe
	\|$K=$\|	5	10	12	5	10	12	5	10	12
\|$T^{q}$\|	\|$N=30 000$\|	0.10	0.09	0.10	0.46	0.43	0.43	0.47	0.46	0.44
	\|$N=50 000$\|	0.14	0.12	0.11	0.70	0.69	0.68	0.73	0.73	0.71
	\|$N=80 000$\|	0.20	0.17	0.16	0.91	0.92	0.91	0.93	0.92	0.92
\|$T^{nq}$\|	\|$N=30 000$\|	0.10	0.09	0.08	0.36	0.33	0.30	0.38	0.38	0.33
	\|$N=50 000$\|	0.14	0.12	0.12	0.58	0.59	0.57	0.64	0.63	0.63
	\|$N=80 000$\|	0.24	0.20	0.19	0.81	0.85	0.84	0.85	0.87	0.87

		Minor			Moderate			Severe
	\|$K=$\|	5	10	12	5	10	12	5	10	12
\|$T^{q}$\|	\|$N=30 000$\|	0.10	0.09	0.10	0.46	0.43	0.43	0.47	0.46	0.44
	\|$N=50 000$\|	0.14	0.12	0.11	0.70	0.69	0.68	0.73	0.73	0.71
	\|$N=80 000$\|	0.20	0.17	0.16	0.91	0.92	0.91	0.93	0.92	0.92
\|$T^{nq}$\|	\|$N=30 000$\|	0.10	0.09	0.08	0.36	0.33	0.30	0.38	0.38	0.33
	\|$N=50 000$\|	0.14	0.12	0.12	0.58	0.59	0.57	0.64	0.63	0.63
	\|$N=80 000$\|	0.24	0.20	0.19	0.81	0.85	0.84	0.85	0.87	0.87

Table 4.

Power of the proposed calibration test for detecting model misspecification at different numbers of subgroups |$K$| and sample size |$N$|⁠. For model misspecification: “Minor”: fitted model missing weak predictors |$(X_1,X_2,X_3)$|⁠; “Moderate”: fitted model missing moderate predictors |$(X_4,X_5,X_6)$|⁠; “Severe”: fitted model missing weak and moderate predictors |$(X_1 - X_6)$|⁠.

		Minor			Moderate			Severe
	\|$K=$\|	5	10	12	5	10	12	5	10	12
\|$T^{q}$\|	\|$N=30 000$\|	0.10	0.09	0.10	0.46	0.43	0.43	0.47	0.46	0.44
	\|$N=50 000$\|	0.14	0.12	0.11	0.70	0.69	0.68	0.73	0.73	0.71
	\|$N=80 000$\|	0.20	0.17	0.16	0.91	0.92	0.91	0.93	0.92	0.92
\|$T^{nq}$\|	\|$N=30 000$\|	0.10	0.09	0.08	0.36	0.33	0.30	0.38	0.38	0.33
	\|$N=50 000$\|	0.14	0.12	0.12	0.58	0.59	0.57	0.64	0.63	0.63
	\|$N=80 000$\|	0.24	0.20	0.19	0.81	0.85	0.84	0.85	0.87	0.87

		Minor			Moderate			Severe
	\|$K=$\|	5	10	12	5	10	12	5	10	12
\|$T^{q}$\|	\|$N=30 000$\|	0.10	0.09	0.10	0.46	0.43	0.43	0.47	0.46	0.44
	\|$N=50 000$\|	0.14	0.12	0.11	0.70	0.69	0.68	0.73	0.73	0.71
	\|$N=80 000$\|	0.20	0.17	0.16	0.91	0.92	0.91	0.93	0.92	0.92
\|$T^{nq}$\|	\|$N=30 000$\|	0.10	0.09	0.08	0.36	0.33	0.30	0.38	0.38	0.33
	\|$N=50 000$\|	0.14	0.12	0.12	0.58	0.59	0.57	0.64	0.63	0.63
	\|$N=80 000$\|	0.24	0.20	0.19	0.81	0.85	0.84	0.85	0.87	0.87

To evaluate consistency of the proposed estimators for discrimination performance measures based on positive-only data, we compared these estimates to their benchmarks, which were calculated using the true phenotype status. As shown in Table 5, when the data-generating model (3.1) was fitted to each simulated data set, the accuracy measure estimators, |$\widehat{{\rm TPR}_v}$|⁠, |$\widehat{{\rm FPR}_v}$|⁠, |$\widehat{{\rm PPV}_v}$|⁠, |$\widehat{{\rm NPV}_v}$|⁠, and |$\widehat{\rm AUC}$| were very close to their respective benchmarks. The negligibly small biases indicated statistical consistency of the proposed estimators. To assess the asymptotic variance as an approximation of the true variance, we compared the average asymptotic (ASE) and empirical (ESE) standard errors based on 1000 replications. Some differences were observed with |$N=10\,000$| as shown in Table S2 of the Supplementary material available at Biostatistics online. However the differences became negligible when the sample size was increased to 80 000 (Table S2 of the Supplementary material available at Biostatistics online). It appears that relatively large sample sizes are needed for the asymptotic variances to approximate closely to the true variances. We therefore recommend using bootstrapping to estimate standard errors in practice.

Table 5.

Discrimination performance measures evaluated among unlabeled patients when true phenotype status is observed or with positive-only data for |$N=10 000$| and |$N=80 000$|⁠. Results are based on the mean over 1000 replications.

		\|$Y$\| observed (Benchmark)					Positive-only
	Cutoff	PPV	NPV	TPR	FPR	AUC	\|$\widehat{\rm PPV}$\|	\|$\widehat{\rm NPV}$\|	\|$\widehat{\rm TPR}$\|	\|$\widehat{\rm FPR}$\|	\|$\widehat{\rm AUC}$\|
\|$N=10 000$\|	0.2	0.677	0.994	0.959	0.059	0.990	0.678	0.995	0.962	0.059	0.991
	0.4	0.789	0.989	0.913	0.032	—	0.792	0.989	0.918	0.031	—
	0.6	0.868	0.981	0.855	0.017	—	0.873	0.982	0.860	0.016	—
	0.8	0.935	0.971	0.766	0.007	—	0.939	0.971	0.769	0.006	—
\|$N=80 000$\|	0.2	0.675	0.995	0.961	0.059	0.990	0.675	0.995	0.961	0.059	0.990
	0.4	0.789	0.989	0.916	0.032	—	0.789	0.989	0.916	0.031	—
	0.6	0.870	0.982	0.857	0.016	—	0.871	0.982	0.858	0.016	—
	0.8	0.938	0.971	0.765	0.007	—	0.938	0.971	0.766	0.006	—

		\|$Y$\| observed (Benchmark)					Positive-only
	Cutoff	PPV	NPV	TPR	FPR	AUC	\|$\widehat{\rm PPV}$\|	\|$\widehat{\rm NPV}$\|	\|$\widehat{\rm TPR}$\|	\|$\widehat{\rm FPR}$\|	\|$\widehat{\rm AUC}$\|
\|$N=10 000$\|	0.2	0.677	0.994	0.959	0.059	0.990	0.678	0.995	0.962	0.059	0.991
	0.4	0.789	0.989	0.913	0.032	—	0.792	0.989	0.918	0.031	—
	0.6	0.868	0.981	0.855	0.017	—	0.873	0.982	0.860	0.016	—
	0.8	0.935	0.971	0.766	0.007	—	0.939	0.971	0.769	0.006	—
\|$N=80 000$\|	0.2	0.675	0.995	0.961	0.059	0.990	0.675	0.995	0.961	0.059	0.990
	0.4	0.789	0.989	0.916	0.032	—	0.789	0.989	0.916	0.031	—
	0.6	0.870	0.982	0.857	0.016	—	0.871	0.982	0.858	0.016	—
	0.8	0.938	0.971	0.765	0.007	—	0.938	0.971	0.766	0.006	—

Table 5.

Discrimination performance measures evaluated among unlabeled patients when true phenotype status is observed or with positive-only data for |$N=10 000$| and |$N=80 000$|⁠. Results are based on the mean over 1000 replications.

		\|$Y$\| observed (Benchmark)					Positive-only
	Cutoff	PPV	NPV	TPR	FPR	AUC	\|$\widehat{\rm PPV}$\|	\|$\widehat{\rm NPV}$\|	\|$\widehat{\rm TPR}$\|	\|$\widehat{\rm FPR}$\|	\|$\widehat{\rm AUC}$\|
\|$N=10 000$\|	0.2	0.677	0.994	0.959	0.059	0.990	0.678	0.995	0.962	0.059	0.991
	0.4	0.789	0.989	0.913	0.032	—	0.792	0.989	0.918	0.031	—
	0.6	0.868	0.981	0.855	0.017	—	0.873	0.982	0.860	0.016	—
	0.8	0.935	0.971	0.766	0.007	—	0.939	0.971	0.769	0.006	—
\|$N=80 000$\|	0.2	0.675	0.995	0.961	0.059	0.990	0.675	0.995	0.961	0.059	0.990
	0.4	0.789	0.989	0.916	0.032	—	0.789	0.989	0.916	0.031	—
	0.6	0.870	0.982	0.857	0.016	—	0.871	0.982	0.858	0.016	—
	0.8	0.938	0.971	0.765	0.007	—	0.938	0.971	0.766	0.006	—

		\|$Y$\| observed (Benchmark)					Positive-only
	Cutoff	PPV	NPV	TPR	FPR	AUC	\|$\widehat{\rm PPV}$\|	\|$\widehat{\rm NPV}$\|	\|$\widehat{\rm TPR}$\|	\|$\widehat{\rm FPR}$\|	\|$\widehat{\rm AUC}$\|
\|$N=10 000$\|	0.2	0.677	0.994	0.959	0.059	0.990	0.678	0.995	0.962	0.059	0.991
	0.4	0.789	0.989	0.913	0.032	—	0.792	0.989	0.918	0.031	—
	0.6	0.868	0.981	0.855	0.017	—	0.873	0.982	0.860	0.016	—
	0.8	0.935	0.971	0.766	0.007	—	0.939	0.971	0.769	0.006	—
\|$N=80 000$\|	0.2	0.675	0.995	0.961	0.059	0.990	0.675	0.995	0.961	0.059	0.990
	0.4	0.789	0.989	0.916	0.032	—	0.789	0.989	0.916	0.031	—
	0.6	0.870	0.982	0.857	0.016	—	0.871	0.982	0.858	0.016	—
	0.8	0.938	0.971	0.765	0.007	—	0.938	0.971	0.766	0.006	—

4. Data example

4.1. Preliminary phenotyping models for primary aldosteronism

Primary aldosteronism (PA) is the most common cause of secondary hypertension, accounting for 5-10|$\%$| of hypertensive patients (Rossi and others, 1996;Oenolle and others, 1993;Shigematsu and others, 1997). PA is severely under-detected (Mulatero and others, 2016). In order to efficiently identify patients with PA from Penn Medicine EHR for further research, we have previously developed two PA phenotyping models (Zhang and others, 2019). To do this, we first specified two anchor variables for PA, |$S_1$| and |$S_2$|⁠. |$S_1$| was specified as whether the patient was included in an existing expert-curated PA research registry. Every patient included in this registry is a definitive PA case (Wachtel and others, 2016), because they underwent a diagnostic procedure, adrenal vein sampling, which was only performed on patients who were confirmed to have primary aldosteronism. |$S_2$| was specified by adding patients with laboratory test orders for adrenal vein sampling cortisol or aldosterone testing, which is only performed as part of the adrenal vein sampling procedure, to |$S_1$|-positive set.

The data set contained 6319 patients who had orders for a PA screening laboratory test, with 149 (2.4|$\%$|⁠) cases labeled by |$S_1$| and 196 (3.1|$\%$|⁠) labeled by |$S_2$|⁠. For each anchor |$S_1$| and |$S_2$|⁠, we developed a corresponding logistic regression phenotyping model in the form of (2.2) with predictors including demographics, laboratory results, encounter metadata, diagnosis codes, selected variables that were available at the time of screening and some variables extracted from unstructured clinical text. The detailed forms of the phenotyping models are included in Supplementary Tables 6 and 7 of Zhang and others (2019). The labeling sensitivity of |$S_1$| and |$S_2$| was estimated as 0.56 (95|$\%$| CI 0.46–0.65) and 0.61 (95|$\%$| CI 0.54–0.69), and PA prevalence as 4|$\%$| (95|$\%$| CI 3|$\%$|–5|$\%$|⁠) and 5|$\%$| (95|$\%$| CI 4|$\%$|–6|$\%$|⁠), respectively. In this study, we assess the calibration and discrimination performance of the developed models among the unlabeled patients.

4.2. Validation of PA phenotyping models

To validate the two positive-only EHR phenotyping models for PA, we tested their calibration and evaluated their discrimination performances among the unlabeled patients. All results are based one 1000 bootstrap replicates. As shown in Table 6, we observed good agreement between the nonparametric and parametric estimates of PA cases for both anchors, with difference |$\leq$|3 for all intervals except for 0.9–1.0, which had 106 nonparametric vs. 99 parametric PA cases when using anchor |$S_1$|⁠. As shown in Table 7, when the PA prevalence was unknown, the calibration statistic |$T^{nq}$| achieved p-values of 0.78 (⁠|$T^{nq}$| = 5.60) and 0.53 (⁠|$T^{nq}$| = 8.05) for |$S_1$| and |$S_2$| based models when the number of subgroups |$K$| was taken to be 10, indicating good calibration. The PA prevalence in our patient population is around |$5\%$| based on historical information. We therefore applied the calibration test |$T^{q}$|⁠, which indicated excellent goodness-of-fit of the fitted models (⁠|$T^{q}=5.84$| and |$5.42$|⁠; p-value |$=0.83$| and |$0.86$| for |$S_1$| and |$S_2$| based models, respectively). We then evaluated their discrimination performance using the proposed estimators. For both anchors, the model achieved |$AUC>0.98$|⁠. The PPV and TPR were estimated as |$0.83$| and |$0.75$| at the decision threshold 0.5 for |$S_1$|-based phenotyping model, and |$0.81$| and |$0.79$| for |$S_2$|-based model (Table 8). Charts for patients with |$p(Y=1|{\bf X};\widehat{\boldsymbol\beta})>0.5$| were reviewed by clinicians. According to the chart review results, the PPVs for models based on |$S_1$| and |$S_2$| were estimated at |$0.78$| and |$0.77,$| respectively. Interestingly, the estimated PPVs based on the |$S_1$| model were slightly higher than those based on the |$S_2$| model, while the estimated TPRs were slightly lower. This might be due to the lower prevalence of PA cases among the patients unlabeled with |$S_2$|⁠. The standard errors based on the |$S_2$| model were generally smaller for the corresponding estimates. This is reasonable because |$S_2$| labeled more cases.

Table 6.

Contingency table of nonparametric vs. parametric estimates of PA cases in intervals of estimated probabilities for the two PA phenotyping models based on anchor variable |$S_1$| and |$S_2$|⁠, respectively.

	Anchor \|$S_1$\|		Anchor \|$S_2$\|
Interval	Npara	Para	Npara	Para
0_0.1	12	11	7	7
0.1_0.2	8	6	4	4
0.2_0.3	3	6	6	3
0.3_0.4	7	5	1	3
0.4_0.5	7	6	5	3
0.5_0.6	8	9	3	5
0.6_0.7	0	5	3	3
0.7_0.8	8	10	5	5
0.8_0.9	10	13	6	7
0.9_1	106	99	85	84

	Anchor \|$S_1$\|		Anchor \|$S_2$\|
Interval	Npara	Para	Npara	Para
0_0.1	12	11	7	7
0.1_0.2	8	6	4	4
0.2_0.3	3	6	6	3
0.3_0.4	7	5	1	3
0.4_0.5	7	6	5	3
0.5_0.6	8	9	3	5
0.6_0.7	0	5	3	3
0.7_0.8	8	10	5	5
0.8_0.9	10	13	6	7
0.9_1	106	99	85	84

Table 6.

Contingency table of nonparametric vs. parametric estimates of PA cases in intervals of estimated probabilities for the two PA phenotyping models based on anchor variable |$S_1$| and |$S_2$|⁠, respectively.

	Anchor \|$S_1$\|		Anchor \|$S_2$\|
Interval	Npara	Para	Npara	Para
0_0.1	12	11	7	7
0.1_0.2	8	6	4	4
0.2_0.3	3	6	6	3
0.3_0.4	7	5	1	3
0.4_0.5	7	6	5	3
0.5_0.6	8	9	3	5
0.6_0.7	0	5	3	3
0.7_0.8	8	10	5	5
0.8_0.9	10	13	6	7
0.9_1	106	99	85	84

	Anchor \|$S_1$\|		Anchor \|$S_2$\|
Interval	Npara	Para	Npara	Para
0_0.1	12	11	7	7
0.1_0.2	8	6	4	4
0.2_0.3	3	6	6	3
0.3_0.4	7	5	1	3
0.4_0.5	7	6	5	3
0.5_0.6	8	9	3	5
0.6_0.7	0	5	3	3
0.7_0.8	8	10	5	5
0.8_0.9	10	13	6	7
0.9_1	106	99	85	84

Table 7.

Calibration results for PA phenotyping models built upon two anchors when |$q$| is assumed known as |$5\%$| or unknown respectively when the number of subgroups |$K$| was taken to be 10.

	Anchor \|$S_1$\|		Anchor \|$S_2$\|
	\|$T^{q}$\|	p-value	\|$T^{q}$\|	p-value
When \|$q$\| is known	5.84	0.83	5.42	0.86
When \|$q$\| is unknown	5.60	0.78	8.05	0.53

	Anchor \|$S_1$\|		Anchor \|$S_2$\|
	\|$T^{q}$\|	p-value	\|$T^{q}$\|	p-value
When \|$q$\| is known	5.84	0.83	5.42	0.86
When \|$q$\| is unknown	5.60	0.78	8.05	0.53

Table 7.

Calibration results for PA phenotyping models built upon two anchors when |$q$| is assumed known as |$5\%$| or unknown respectively when the number of subgroups |$K$| was taken to be 10.

	Anchor \|$S_1$\|		Anchor \|$S_2$\|
	\|$T^{q}$\|	p-value	\|$T^{q}$\|	p-value
When \|$q$\| is known	5.84	0.83	5.42	0.86
When \|$q$\| is unknown	5.60	0.78	8.05	0.53

	Anchor \|$S_1$\|		Anchor \|$S_2$\|
	\|$T^{q}$\|	p-value	\|$T^{q}$\|	p-value
When \|$q$\| is known	5.84	0.83	5.42	0.86
When \|$q$\| is unknown	5.60	0.78	8.05	0.53

Table 8.

Estimated accuracy measures and their associated empirical standard errors (in parentheses) for identifying PA patients using our proposed method for positive-only data. Results based on mean over 1000 bootstrap replicates.

	Anchor \|$S_1$\|					Anchor \|$S_2$\|
cutoff	\|$\widehat{\rm PPV}$\|	\|$\widehat{\rm NPV}$\|	\|$\widehat{\rm TPR}$\|	\|$\widehat{\rm FPR}$\|	\|$\widehat{\rm AUC}$\|	\|$\widehat{\rm PPV}$\|	\|$\widehat{\rm NPV}$\|	\|$\widehat{\rm TPR}$\|	\|$\widehat{\rm FPR}$\|	\|$\widehat{\rm AUC}$\|
0.1	0.413	0.997	0.89	0.03	0.983	0.395	0.998	0.906	0.028	0.986
0.1	(0.124)	(0.001)	(0.041)	(0.005)	(0.008)	(0.072)	(0.001)	(0.025)	(0.004)	(0.005)
0.2	0.559	0.997	0.85	0.016	—	0.533	0.997	0.872	0.015	—
0.2	(0.128)	(0.001)	(0.054)	(0.003)	—	(0.077)	(0.001)	(0.03)	(0.002)	—
0.3	0.666	0.996	0.815	0.009	—	0.639	0.997	0.844	0.01	—
0.3	(0.122)	(0.001)	(0.064)	(0.002)	—	(0.074)	(0.001)	(0.034)	(0.002)	—
0.4	0.755	0.995	0.782	0.006	—	0.732	0.996	0.817	0.006	—
0.4	(0.11)	(0.001)	(0.071)	(0.001)	—	(0.071)	(0.001)	(0.039)	(0.001)	—
0.5	0.828	0.994	0.748	0.003	—	0.81	0.996	0.787	0.004	—
0.5	(0.093)	(0.002)	(0.078)	(0.001)	—	(0.063)	(0.001)	(0.042)	(0.001)	—
0.6	0.882	0.993	0.707	0.002	—	0.871	0.995	0.753	0.002	—
0.6	(0.078)	(0.002)	(0.087)	(0.001)	—	(0.056)	(0.001)	(0.048)	(0.001)	—
0.7	0.915	0.992	0.66	0.001	—	0.906	0.994	0.708	0.001	—
0.7	(0.065)	(0.002)	(0.098)	(0.001)	—	(0.052)	(0.001)	(0.054)	(0.001)	—
0.8	0.922	0.991	0.601	0.001	—	0.919	0.993	0.649	0.001	—
0.8	(0.058)	(0.002)	(0.107)	(0.001)	—	(0.047)	(0.001)	(0.06)	(0.001)	—
0.9	0.899	0.989	0.518	0.001	—	0.915	0.991	0.571	0.001	—
0.9	(0.068)	(0.003)	(0.12)	(0.001)	—	(0.053)	(0.002)	(0.066)	(0.001)	—

	Anchor \|$S_1$\|					Anchor \|$S_2$\|
cutoff	\|$\widehat{\rm PPV}$\|	\|$\widehat{\rm NPV}$\|	\|$\widehat{\rm TPR}$\|	\|$\widehat{\rm FPR}$\|	\|$\widehat{\rm AUC}$\|	\|$\widehat{\rm PPV}$\|	\|$\widehat{\rm NPV}$\|	\|$\widehat{\rm TPR}$\|	\|$\widehat{\rm FPR}$\|	\|$\widehat{\rm AUC}$\|
0.1	0.413	0.997	0.89	0.03	0.983	0.395	0.998	0.906	0.028	0.986
0.1	(0.124)	(0.001)	(0.041)	(0.005)	(0.008)	(0.072)	(0.001)	(0.025)	(0.004)	(0.005)
0.2	0.559	0.997	0.85	0.016	—	0.533	0.997	0.872	0.015	—
0.2	(0.128)	(0.001)	(0.054)	(0.003)	—	(0.077)	(0.001)	(0.03)	(0.002)	—
0.3	0.666	0.996	0.815	0.009	—	0.639	0.997	0.844	0.01	—
0.3	(0.122)	(0.001)	(0.064)	(0.002)	—	(0.074)	(0.001)	(0.034)	(0.002)	—
0.4	0.755	0.995	0.782	0.006	—	0.732	0.996	0.817	0.006	—
0.4	(0.11)	(0.001)	(0.071)	(0.001)	—	(0.071)	(0.001)	(0.039)	(0.001)	—
0.5	0.828	0.994	0.748	0.003	—	0.81	0.996	0.787	0.004	—
0.5	(0.093)	(0.002)	(0.078)	(0.001)	—	(0.063)	(0.001)	(0.042)	(0.001)	—
0.6	0.882	0.993	0.707	0.002	—	0.871	0.995	0.753	0.002	—
0.6	(0.078)	(0.002)	(0.087)	(0.001)	—	(0.056)	(0.001)	(0.048)	(0.001)	—
0.7	0.915	0.992	0.66	0.001	—	0.906	0.994	0.708	0.001	—
0.7	(0.065)	(0.002)	(0.098)	(0.001)	—	(0.052)	(0.001)	(0.054)	(0.001)	—
0.8	0.922	0.991	0.601	0.001	—	0.919	0.993	0.649	0.001	—
0.8	(0.058)	(0.002)	(0.107)	(0.001)	—	(0.047)	(0.001)	(0.06)	(0.001)	—
0.9	0.899	0.989	0.518	0.001	—	0.915	0.991	0.571	0.001	—
0.9	(0.068)	(0.003)	(0.12)	(0.001)	—	(0.053)	(0.002)	(0.066)	(0.001)	—

Table 8.

10.13140/RG.2.1.4760.2080

Estimated accuracy measures and their associated empirical standard errors (in parentheses) for identifying PA patients using our proposed method for positive-only data. Results based on mean over 1000 bootstrap replicates.

	Anchor \|$S_1$\|					Anchor \|$S_2$\|
cutoff	\|$\widehat{\rm PPV}$\|	\|$\widehat{\rm NPV}$\|	\|$\widehat{\rm TPR}$\|	\|$\widehat{\rm FPR}$\|	\|$\widehat{\rm AUC}$\|	\|$\widehat{\rm PPV}$\|	\|$\widehat{\rm NPV}$\|	\|$\widehat{\rm TPR}$\|	\|$\widehat{\rm FPR}$\|	\|$\widehat{\rm AUC}$\|
0.1	0.413	0.997	0.89	0.03	0.983	0.395	0.998	0.906	0.028	0.986
0.1	(0.124)	(0.001)	(0.041)	(0.005)	(0.008)	(0.072)	(0.001)	(0.025)	(0.004)	(0.005)
0.2	0.559	0.997	0.85	0.016	—	0.533	0.997	0.872	0.015	—
0.2	(0.128)	(0.001)	(0.054)	(0.003)	—	(0.077)	(0.001)	(0.03)	(0.002)	—
0.3	0.666	0.996	0.815	0.009	—	0.639	0.997	0.844	0.01	—
0.3	(0.122)	(0.001)	(0.064)	(0.002)	—	(0.074)	(0.001)	(0.034)	(0.002)	—
0.4	0.755	0.995	0.782	0.006	—	0.732	0.996	0.817	0.006	—
0.4	(0.11)	(0.001)	(0.071)	(0.001)	—	(0.071)	(0.001)	(0.039)	(0.001)	—
0.5	0.828	0.994	0.748	0.003	—	0.81	0.996	0.787	0.004	—
0.5	(0.093)	(0.002)	(0.078)	(0.001)	—	(0.063)	(0.001)	(0.042)	(0.001)	—
0.6	0.882	0.993	0.707	0.002	—	0.871	0.995	0.753	0.002	—
0.6	(0.078)	(0.002)	(0.087)	(0.001)	—	(0.056)	(0.001)	(0.048)	(0.001)	—
0.7	0.915	0.992	0.66	0.001	—	0.906	0.994	0.708	0.001	—
0.7	(0.065)	(0.002)	(0.098)	(0.001)	—	(0.052)	(0.001)	(0.054)	(0.001)	—
0.8	0.922	0.991	0.601	0.001	—	0.919	0.993	0.649	0.001	—
0.8	(0.058)	(0.002)	(0.107)	(0.001)	—	(0.047)	(0.001)	(0.06)	(0.001)	—
0.9	0.899	0.989	0.518	0.001	—	0.915	0.991	0.571	0.001	—
0.9	(0.068)	(0.003)	(0.12)	(0.001)	—	(0.053)	(0.002)	(0.066)	(0.001)	—

	Anchor \|$S_1$\|					Anchor \|$S_2$\|
cutoff	\|$\widehat{\rm PPV}$\|	\|$\widehat{\rm NPV}$\|	\|$\widehat{\rm TPR}$\|	\|$\widehat{\rm FPR}$\|	\|$\widehat{\rm AUC}$\|	\|$\widehat{\rm PPV}$\|	\|$\widehat{\rm NPV}$\|	\|$\widehat{\rm TPR}$\|	\|$\widehat{\rm FPR}$\|	\|$\widehat{\rm AUC}$\|
0.1	0.413	0.997	0.89	0.03	0.983	0.395	0.998	0.906	0.028	0.986
0.1	(0.124)	(0.001)	(0.041)	(0.005)	(0.008)	(0.072)	(0.001)	(0.025)	(0.004)	(0.005)
0.2	0.559	0.997	0.85	0.016	—	0.533	0.997	0.872	0.015	—
0.2	(0.128)	(0.001)	(0.054)	(0.003)	—	(0.077)	(0.001)	(0.03)	(0.002)	—
0.3	0.666	0.996	0.815	0.009	—	0.639	0.997	0.844	0.01	—
0.3	(0.122)	(0.001)	(0.064)	(0.002)	—	(0.074)	(0.001)	(0.034)	(0.002)	—
0.4	0.755	0.995	0.782	0.006	—	0.732	0.996	0.817	0.006	—
0.4	(0.11)	(0.001)	(0.071)	(0.001)	—	(0.071)	(0.001)	(0.039)	(0.001)	—
0.5	0.828	0.994	0.748	0.003	—	0.81	0.996	0.787	0.004	—
0.5	(0.093)	(0.002)	(0.078)	(0.001)	—	(0.063)	(0.001)	(0.042)	(0.001)	—
0.6	0.882	0.993	0.707	0.002	—	0.871	0.995	0.753	0.002	—
0.6	(0.078)	(0.002)	(0.087)	(0.001)	—	(0.056)	(0.001)	(0.048)	(0.001)	—
0.7	0.915	0.992	0.66	0.001	—	0.906	0.994	0.708	0.001	—
0.7	(0.065)	(0.002)	(0.098)	(0.001)	—	(0.052)	(0.001)	(0.054)	(0.001)	—
0.8	0.922	0.991	0.601	0.001	—	0.919	0.993	0.649	0.001	—
0.8	(0.058)	(0.002)	(0.107)	(0.001)	—	(0.047)	(0.001)	(0.06)	(0.001)	—
0.9	0.899	0.989	0.518	0.001	—	0.915	0.991	0.571	0.001	—
0.9	(0.068)	(0.003)	(0.12)	(0.001)	—	(0.053)	(0.002)	(0.066)	(0.001)	—

5. Discussion

Traditional methods for model validation depend on a fully labeled validation set. Our proposed methods pave the way to greatly alleviate the need for labor intensive manual labeling, because they do not require labeled controls and rely on the specification of an anchor variable to label the set of cases in positive-only data. Collectively, the anchor variable framework offers a highly efficient approach to EHR phenotyping as well as a cost effective data resource for model validation. As anchor variables continue to be developed for a wide variety of phenotypes, we expect that the methods for model validation proposed here, together with the maximum likelihood approach for model development proposed previously (Zhang and others, 2019), will not only advance accurate EHR phenotyping algorithm development but also facilitate their generalizability, since a well-chosen anchor variable may be more easily transferred across institutions.

Reasons for poor calibration can be complicated in real-life problems. In this work, we omitted predictors from data generation models to induce ill-calibrated models. We intended to use the current simulation studies to demonstrate the validity of our proposed test statistic regarding its size and power. Our results showed that the test statistic is reasonably sensitive to the degree of ill-calibration. More complicated settings, such as plasmode simulations leveraging existing data set can be further studied in future research.

Unlike the Hosmer–Lemeshow test whose result depends markedly on the number of risk subgroups, our proposed calibration tests are robust to the number of subgroups because our proposed calibration statistic |$T^q$| or |$T^{nq}$| follow a |$\chi^2$| distribution as long as the variance-covariance matrix |${\bf V}_K$| or |${\bf V}_{K-1}$| is well approximated. Thus sparse cells should be avoided in |${\bf U}_{K}$| or |${\bf U}_{K-1}$| and we proposed grouping based on fixed values. In initial attempts of grouping based on percentiles of fitted probabilities, we observed that the expected number of cases can be very low in the lower percentile intervals, especially when the phenotype prevalence is low. This could lead to empty cells in |${\bf U}_{K}$| or |${\bf U}_{K-1}$|⁠, compromising the estimation of |${\bf V}_K$| or |${\bf V}_{K-1}$| so as to sabotage the performance of the proposed calibration tests. We recommend that the within-group expected number of cases be taken into consideration in the selection of the number of subgroups |$K$|⁠. Similar to the Hosmer–Lemeshow goodness-of-fit test, the power of our proposed calibration test can be low when the sample size is not large. In such situation, we recommend looking at the contingency table instead of relying on the p-value of the calibration test. As shown in Table 1, poor calibration can be revealed by the disagreement between the nonparametric and parametric estimates of the number of cases.

The proposed calibration statistics asymptotically follow the |$\chi^2$| distribution when the fitted model is correctly specified provided that the anchor variable is well defined, i.e., the anchor variable has perfect PPV and constant sensitivity. Thus with a validated anchor variable, small p-values of the proposed calibration tests mean that the model is a poor fit. Noting that the calibration results can be affected by either model misspecification or invalid anchor variable specification, further research is needed for extensions to discriminate the two causes.

6. Software

Software in the form of R code, together with a sample input dataset and complete documentation is available at "https://github.com/zhanglingjiao/Calibration-paper or upon request from the corresponding authors ([email protected] or [email protected]).

7. Supplementary material

Supplementary material is available online at http://biostatistics.oxfordjournals.org.

8. Acknowledgments

Conflict of Interest: None declared.

References

Claesen

,

M.

,

Davis

,

J.

,

De Smet

,

F.

and

De Moor

,

B.

(

2015

).

Assessing binary classifiers using only positive and unlabeled data

. Doi:

.

Elkan

,

C.

and

Noto

,

K.

(

2008

).

Learning classifiers from only positive and unlabeled data

. In:

Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, Nevada, USA

. ACM. pp.

213

–

220

.

Halpern

,

Y.

,

Choi

,

Y.

,

Horng

,

S.

and

Sontag

,

D.

(

2014

).

Using anchors to estimate clinical state without labeled data

. In:

AMIA Annual Symposium Proceedings, American Medical Informatics Association

,

2014

,

606

–

615

.

Halpern

,

Y.

,

Horng

,

S.

,

Choi

,

Y.

and

Sontag

,

D.

(

2016

).

Electronic medical record phenotyping using the anchor and learn framework

.

Journal of the American Medical Informatics Association

23

,

731

–

740

.

Hong

,

C.

,

Liao

,

K. P.

and

Cai

,

T.

(

2019

).

Semi-supervised validation of multiple surrogate outcomes with application to electronic medical records phenotyping

.

Biometrics

75

,

78

–

89

.

Hosmer

,

D. W

,

Lemeshow

,

S.

and

Klar

,

J.

(

1988

).

Goodness-of-fit testing for the logistic regression model when the estimated probabilities are small

.

Biometrical Journal

30

,

911

–

924

.

Hosmer

,

D. W.

Jr,

Lemeshow

,

S.

and

Sturdivant

,

R. X.

(

2013

).

Applied Logistic Regression

, Vol.

398

.

John Wiley & Sons

, Doi:

10.1002/9781118548387

.

Mulatero

,

P.

,

Monticone

,

S.

,

Burrello

,

J.

,

Veglio

,

F.

,

Williams

,

T. A.

and

Funder

,

J.

(

2016

).

Guidelines for primary aldosteronism: uptake by primary care physicians in europe

.

Journal of Hypertension

34

,

2253

–

2257

.

Oenolle

,

T.

,

Chaiellier

,

G.

,

Julien

,

J.

,

Battaglia

,

C.

,

Luo

,

P.

and

Plouin

,

P.

(

1993

).

Left ventricular mass and geometry before and after etiologic treatment in renovascular hypertension, aldosterone-producing adenoma, and pheochromocytoma

.

American Journal of Hypertension

6

,

907

–

913

.

Pathak

,

J.

,

Kho

,

A. N.

and

Denny

,

J. C.

(

2013

).

Electronic health records-driven phenotyping: challenges, recent advances, and perspectives,

J Am Med Inform Assoc.

20

(

e2

):

e206

–

11

.

Rossi

,

G. P.

,

Sacchetto

,

A.

,

Visentin

,

P.

,

Canali

,

C.

,

Graniero

,

G. R.

,

Palatini

,

P.

and

Pessina

,

A. C.

(

1996

).

Changes in left ventricular anatomy and function in hypertension and primary aldosteronism

.

Hypertension

27

,

1039

–

1045

.

Shigematsu

,

Y.

,

Hamada

,

M.

,

Okayama

,

H.

,

Hara

,

Y.

,

Hayashi

,

Y.

,

Kodama

,

K.

,

Kohara

,

K.

and

Hiwada

,

K.

(

1997

).

Left ventricular hypertrophy precedes other target-organ damage in primary aldosteronism

.

Hypertension

29

,

723

–

727

.

Shivade

,

C.

,

Raghavan

,

P.

,

Fosler-Lussier

,

E.

,

Embi

,

P. J.

,

Elhadad

,

N.

,

Johnson

,

S. B.

and

Lai

,

A. M.

(

2013

).

A review of approaches to identifying patient phenotype cohorts using electronic health records

.

Journal of the American Medical Informatics Association

21

,

221

–

230

.

Song

,

M.

,

Kraft

,

P.

,

Joshi

,

A. D.

,

Barrdahl

,

M.

and

Chatterjee

,

N.

(

2014

).

Testing calibration of risk models at extremes of disease risk

.

Biostatistics

16

,

143

–

154

.

Tsiatis

,

A. A.

(

1980

).

A note on a goodness-of-fit test for the logistic regression model

.

Biometrika

67

,

250

–

251

.

Wachtel

,

H.

,

Zaheer

,

S.

,

Shah

,

P. K.

,

Trerotola

,

S. O.

,

Karakousis

,

G. C.

,

Roses

,

R. E.

,

Cohen

,

D. L.

and

Fraker

,

D. L.

(

2016

).

Role of adrenal vein sampling in primary aldosteronism: impact of imaging, localization, and age

.

Journal of Surgical Oncology

113

,

532

–

537

.

Wang

,

L.

,

Schnall

,

J.

,

Small

,

A.

,

Hubbard

,

R. A.

,

Moore

,

J. H.

,

Damrauer

,

S. M.

and

Chen

,

J.

(

2020

).

Case contamination in electronic health records based case-control studies

.

Biometrics

, doi:

10.1111/biom.13264

.

Windmeijer

,

F. A. G.

(

1990

).

The asymptotic distribution of the sum of weighted squared residuals in binary choice models

.

Statistica Neerlandica

44

,

69

–

78

.

Yu

,

S.

,

Chakrabortty

,

A.

,

Liao

,

K. P.

,

Cai

,

T.

,

Ananthakrishnan

,

A. N.

,

Gainer

,

V. S.

,

Churchill

,

S. E.

,

Szolovits

,

P.

,

Murphy

,

S. N.

,

Kohane

,

I. S.

and others. (

2016

).

Surrogate-assisted feature extraction for high-throughput phenotyping

.

Journal of the American Medical Informatics Association

24

,

e143

–

e149

.

Zhang

,

L.

,

Ding

,

X.

,

Ma

,

Y.

,

Muthu

,

N.

,

Ajmal

,

I.

,

Moore

,

J. H.

,

Herman

,

D.

and

Chen

,

J.

(

2019

).

A maximum likelihood approach for electronic health record phenotyping using positive and unlabeled patients

.

Journal of American Medical Informatics Association

,

27

(

1

),

119

–

126

.