A divide-and-conquer method for sparse risk prediction and evaluation Free

					Bias						CP
	p	Method	Time	GMSE	\|$b_1$\|	\|$b_2$\|	\|$b_3$\|	\|$b_4$\|	\|$b_5$\|	\|$b_6$\|	\|$b_1$\|	\|$b_2$\|	\|$b_3$\|	\|$b_4$\|	\|$b_5$\|
I	50	FULL	8.97	5.11	2.78	2.63	2.50	2.51	2.72	0.00	94.0	94.3	95.2	95.0	93.2
		SOLID	0.06	5.54	3.02	2.74	2.53	2.51	2.59	0.00	95.6	95.6	94.9	95.3	95.7
		SMA	0.24	156	15.9	14.2	8.74	7.43	5.10	1.25	0.0	0.0	22.7	56.8	63.6
		Tang	7.01	44.2	3.18	3.06	2.69	2.60	2.71	2.57	93.7	93.9	94.6	95.3	94.7
		Chen	7.01	5.93	3.07	2.96	2.55	2.48	2.61	0.00	93.3	93.0	94.9	94.9	94.9
	500	FULL	103	5.42	2.83	2.87	2.51	2.58	2.63	0.00	95.1	95.3	94.8	94.7	94.3
		SOLID	1.66	5.33	2.78	2.66	2.51	2.54	2.61	0.00	94.6	95.6	94.3	94.5	92.5
		SMA	21.8	2927	91.1	74.3	40.3	23.7	15.6	0.07	0.0	0.0	0.0	0.0	0.0
		Tang	94.0	429	3.01	2.98	2.76	2.65	2.71	2.58	93.1	95.0	94.3	94.5	94.8
		Chen	92.9	5.60	2.90	2.81	2.58	2.55	2.54	0.00	93.1	94.8	93.5	94.3	94.8
	1000	FULL	539	5.11	2.87	2.73	2.30	2.32	2.39	0.00	87.6	88.0	96.8	92.2	94.0
		SOLID	9.41	5.53	2.89	2.73	2.64	2.70	2.71	0.00	94.2	94.2	93.4	95.0	94.2
		SMA	100	10365	180	145	77.0	42.8	25.6	0.01	0.0	0.0	0.0	0.0	0.0
		Tang	200	862	3.12	2.90	2.64	2.47	2.48	2.50	93.7	94.2	95.5	95.3	95.8
		Chen	206	5.34	3.01	2.77	2.53	2.37	2.36	0.00	94.7	92.6	94.5	94.7	95.0
II	50	FULL	9.41	12.8	3.30	2.79	2.79	2.86	2.79	0.00	96.0	94.4	94.8	95.1	93.9
		SOLID	0.06	13.0	3.14	2.99	2.98	2.70	2.80	0.00	95.1	95.2	94.9	95.0	94.8
		SMA	0.33	80.4	9.01	7.32	4.14	3.22	3.04	0.00	33.6	51.3	84.0	91.8	90.3
		Tang	7.08	59.1	3.90	3.56	2.91	2.88	2.88	2.75	88.9	91.2	93.9	94.9	95.6
		Chen	7.06	18.8	3.86	3.52	2.89	2.81	2.79	0.00	88.7	91.5	94.8	94.7	95.3
	500	FULL	111	13.0	2.97	3.06	2.62	2.73	2.96	0.00	94.0	94.5	94.5	94.1	93.1
		SOLID	1.82	14.4	3.03	3.05	2.88	2.73	2.88	0.00	94.4	93.3	94.8	95.5	92.8
		SMA	25.8	6546	90.4	72.2	36.0	17.6	10.6	0.00	0.0	0.0	0.0	0.0	12.7
		Tang	83.4	504	3.73	3.49	2.92	2.85	2.78	2.75	88.7	91.6	94.2	94.6	94.7
		Chen	83.1	18.4	3.66	3.44	2.88	2.74	2.67	0.00	88.4	91.6	94.2	93.6	94.8
	1000	FULL	303	13.3	3.30	3.21	2.66	2.44	3.09	0.00	92.2	97.7	94.8	92.8	92.8
		SOLID	8.80	14.6	3.53	3.22	2.60	2.73	2.67	0.00	94.6	95.5	95.5	94.6	91.1
		SMA	114	30202	194	156	77.5	39.3	21.2	0.00	0.0	0.0	0.0	0.0	0.0
		Tang	233	1001	3.72	3.52	2.96	2.79	2.83	2.86	89.5	92.1	93.8	92.8	95.1
		Chen	231	19.1	3.67	3.40	2.90	2.79	2.70	0.00	88.5	93.1	93.1	94.7	95.4
III	50	FULL	9.29	9.56	2.65	2.52	2.59	2.48	2.62	0.00	95.1	95.4	95.8	94.9	92.9
		SOLID	0.08	9.49	2.65	2.49	2.58	2.47	2.53	0.00	95.1	95.7	95.6	94.9	93.3
		SMA	0.31	21.3	4.03	3.52	2.64	2.60	2.80	0.00	81.1	86.0	93.1	93.0	93.2
		Tang	7.33	44.1	2.96	2.70	2.47	2.59	2.50	2.61	93.0	94.2	94.1	94.4	95.4
		Chen	7.34	476	6.62	7.24	7.59	7.87	50.0	0.00	48.7	39.0	27.1	28.5	0.0
	500	FULL	149	10.5	2.41	2.50	2.30	2.60	3.03	0.00	95.3	94.1	95.1	95.6	92.2
		SOLID	1.72	10.3	2.56	2.54	2.40	2.59	2.96	0.00	94.7	95.6	95.0	94.9	91.2
		SMA	23.8	1145	37.1	29.1	14.9	7.59	6.06	0.00	0.0	0.0	0.0	35.1	68.0
		Tang	89.7	423	2.88	2.76	2.56	2.71	2.46	2.61	92.9	95.1	94.7	94.5	94.2
		Chen	87.9	476	6.69	7.20	7.81	7.73	50.0	0.00	43.1	38.3	25.3	33.8	0.0
	1000	FULL	262	9.78	2.72	2.69	2.87	2.54	3.15	0.00	89.1	95.2	96.7	93.9	87.6
		SOLID	8.43	10.1	2.45	2.53	2.35	2.44	2.83	0.00	93.3	94.4	94.4	95.6	93.3
		SMA	100	5215	80.0	64.5	32.1	16.6	10.6	0.00	0.0	0.0	0.0	0.0	10.0
		Tang	222	845	2.79	2.80	2.60	2.60	2.63	2.57	93.1	93.6	93.6	94.7	95.5
		Chen	219	476	6.84	7.35	7.48	7.80	50.0	0.00	37.9	40.8	33.3	28.8	0.0

					Bias						CP
	p	Method	Time	GMSE	\|$b_1$\|	\|$b_2$\|	\|$b_3$\|	\|$b_4$\|	\|$b_5$\|	\|$b_6$\|	\|$b_1$\|	\|$b_2$\|	\|$b_3$\|	\|$b_4$\|	\|$b_5$\|
I	50	FULL	8.97	5.11	2.78	2.63	2.50	2.51	2.72	0.00	94.0	94.3	95.2	95.0	93.2
		SOLID	0.06	5.54	3.02	2.74	2.53	2.51	2.59	0.00	95.6	95.6	94.9	95.3	95.7
		SMA	0.24	156	15.9	14.2	8.74	7.43	5.10	1.25	0.0	0.0	22.7	56.8	63.6
		Tang	7.01	44.2	3.18	3.06	2.69	2.60	2.71	2.57	93.7	93.9	94.6	95.3	94.7
		Chen	7.01	5.93	3.07	2.96	2.55	2.48	2.61	0.00	93.3	93.0	94.9	94.9	94.9
	500	FULL	103	5.42	2.83	2.87	2.51	2.58	2.63	0.00	95.1	95.3	94.8	94.7	94.3
		SOLID	1.66	5.33	2.78	2.66	2.51	2.54	2.61	0.00	94.6	95.6	94.3	94.5	92.5
		SMA	21.8	2927	91.1	74.3	40.3	23.7	15.6	0.07	0.0	0.0	0.0	0.0	0.0
		Tang	94.0	429	3.01	2.98	2.76	2.65	2.71	2.58	93.1	95.0	94.3	94.5	94.8
		Chen	92.9	5.60	2.90	2.81	2.58	2.55	2.54	0.00	93.1	94.8	93.5	94.3	94.8
	1000	FULL	539	5.11	2.87	2.73	2.30	2.32	2.39	0.00	87.6	88.0	96.8	92.2	94.0
		SOLID	9.41	5.53	2.89	2.73	2.64	2.70	2.71	0.00	94.2	94.2	93.4	95.0	94.2
		SMA	100	10365	180	145	77.0	42.8	25.6	0.01	0.0	0.0	0.0	0.0	0.0
		Tang	200	862	3.12	2.90	2.64	2.47	2.48	2.50	93.7	94.2	95.5	95.3	95.8
		Chen	206	5.34	3.01	2.77	2.53	2.37	2.36	0.00	94.7	92.6	94.5	94.7	95.0
II	50	FULL	9.41	12.8	3.30	2.79	2.79	2.86	2.79	0.00	96.0	94.4	94.8	95.1	93.9
		SOLID	0.06	13.0	3.14	2.99	2.98	2.70	2.80	0.00	95.1	95.2	94.9	95.0	94.8
		SMA	0.33	80.4	9.01	7.32	4.14	3.22	3.04	0.00	33.6	51.3	84.0	91.8	90.3
		Tang	7.08	59.1	3.90	3.56	2.91	2.88	2.88	2.75	88.9	91.2	93.9	94.9	95.6
		Chen	7.06	18.8	3.86	3.52	2.89	2.81	2.79	0.00	88.7	91.5	94.8	94.7	95.3
	500	FULL	111	13.0	2.97	3.06	2.62	2.73	2.96	0.00	94.0	94.5	94.5	94.1	93.1
		SOLID	1.82	14.4	3.03	3.05	2.88	2.73	2.88	0.00	94.4	93.3	94.8	95.5	92.8
		SMA	25.8	6546	90.4	72.2	36.0	17.6	10.6	0.00	0.0	0.0	0.0	0.0	12.7
		Tang	83.4	504	3.73	3.49	2.92	2.85	2.78	2.75	88.7	91.6	94.2	94.6	94.7
		Chen	83.1	18.4	3.66	3.44	2.88	2.74	2.67	0.00	88.4	91.6	94.2	93.6	94.8
	1000	FULL	303	13.3	3.30	3.21	2.66	2.44	3.09	0.00	92.2	97.7	94.8	92.8	92.8
		SOLID	8.80	14.6	3.53	3.22	2.60	2.73	2.67	0.00	94.6	95.5	95.5	94.6	91.1
		SMA	114	30202	194	156	77.5	39.3	21.2	0.00	0.0	0.0	0.0	0.0	0.0
		Tang	233	1001	3.72	3.52	2.96	2.79	2.83	2.86	89.5	92.1	93.8	92.8	95.1
		Chen	231	19.1	3.67	3.40	2.90	2.79	2.70	0.00	88.5	93.1	93.1	94.7	95.4
III	50	FULL	9.29	9.56	2.65	2.52	2.59	2.48	2.62	0.00	95.1	95.4	95.8	94.9	92.9
		SOLID	0.08	9.49	2.65	2.49	2.58	2.47	2.53	0.00	95.1	95.7	95.6	94.9	93.3
		SMA	0.31	21.3	4.03	3.52	2.64	2.60	2.80	0.00	81.1	86.0	93.1	93.0	93.2
		Tang	7.33	44.1	2.96	2.70	2.47	2.59	2.50	2.61	93.0	94.2	94.1	94.4	95.4
		Chen	7.34	476	6.62	7.24	7.59	7.87	50.0	0.00	48.7	39.0	27.1	28.5	0.0
	500	FULL	149	10.5	2.41	2.50	2.30	2.60	3.03	0.00	95.3	94.1	95.1	95.6	92.2
		SOLID	1.72	10.3	2.56	2.54	2.40	2.59	2.96	0.00	94.7	95.6	95.0	94.9	91.2
		SMA	23.8	1145	37.1	29.1	14.9	7.59	6.06	0.00	0.0	0.0	0.0	35.1	68.0
		Tang	89.7	423	2.88	2.76	2.56	2.71	2.46	2.61	92.9	95.1	94.7	94.5	94.2
		Chen	87.9	476	6.69	7.20	7.81	7.73	50.0	0.00	43.1	38.3	25.3	33.8	0.0
	1000	FULL	262	9.78	2.72	2.69	2.87	2.54	3.15	0.00	89.1	95.2	96.7	93.9	87.6
		SOLID	8.43	10.1	2.45	2.53	2.35	2.44	2.83	0.00	93.3	94.4	94.4	95.6	93.3
		SMA	100	5215	80.0	64.5	32.1	16.6	10.6	0.00	0.0	0.0	0.0	0.0	10.0
		Tang	222	845	2.79	2.80	2.60	2.60	2.63	2.57	93.1	93.6	93.6	94.7	95.5
		Chen	219	476	6.84	7.35	7.48	7.80	50.0	0.00	37.9	40.8	33.3	28.8	0.0

Table 1.

Open in new tab Download slide

					Bias						CP
	p	Method	Time	GMSE	\|$b_1$\|	\|$b_2$\|	\|$b_3$\|	\|$b_4$\|	\|$b_5$\|	\|$b_6$\|	\|$b_1$\|	\|$b_2$\|	\|$b_3$\|	\|$b_4$\|	\|$b_5$\|
I	50	FULL	8.97	5.11	2.78	2.63	2.50	2.51	2.72	0.00	94.0	94.3	95.2	95.0	93.2
		SOLID	0.06	5.54	3.02	2.74	2.53	2.51	2.59	0.00	95.6	95.6	94.9	95.3	95.7
		SMA	0.24	156	15.9	14.2	8.74	7.43	5.10	1.25	0.0	0.0	22.7	56.8	63.6
		Tang	7.01	44.2	3.18	3.06	2.69	2.60	2.71	2.57	93.7	93.9	94.6	95.3	94.7
		Chen	7.01	5.93	3.07	2.96	2.55	2.48	2.61	0.00	93.3	93.0	94.9	94.9	94.9
	500	FULL	103	5.42	2.83	2.87	2.51	2.58	2.63	0.00	95.1	95.3	94.8	94.7	94.3
		SOLID	1.66	5.33	2.78	2.66	2.51	2.54	2.61	0.00	94.6	95.6	94.3	94.5	92.5
		SMA	21.8	2927	91.1	74.3	40.3	23.7	15.6	0.07	0.0	0.0	0.0	0.0	0.0
		Tang	94.0	429	3.01	2.98	2.76	2.65	2.71	2.58	93.1	95.0	94.3	94.5	94.8
		Chen	92.9	5.60	2.90	2.81	2.58	2.55	2.54	0.00	93.1	94.8	93.5	94.3	94.8
	1000	FULL	539	5.11	2.87	2.73	2.30	2.32	2.39	0.00	87.6	88.0	96.8	92.2	94.0
		SOLID	9.41	5.53	2.89	2.73	2.64	2.70	2.71	0.00	94.2	94.2	93.4	95.0	94.2
		SMA	100	10365	180	145	77.0	42.8	25.6	0.01	0.0	0.0	0.0	0.0	0.0
		Tang	200	862	3.12	2.90	2.64	2.47	2.48	2.50	93.7	94.2	95.5	95.3	95.8
		Chen	206	5.34	3.01	2.77	2.53	2.37	2.36	0.00	94.7	92.6	94.5	94.7	95.0
II	50	FULL	9.41	12.8	3.30	2.79	2.79	2.86	2.79	0.00	96.0	94.4	94.8	95.1	93.9
		SOLID	0.06	13.0	3.14	2.99	2.98	2.70	2.80	0.00	95.1	95.2	94.9	95.0	94.8
		SMA	0.33	80.4	9.01	7.32	4.14	3.22	3.04	0.00	33.6	51.3	84.0	91.8	90.3
		Tang	7.08	59.1	3.90	3.56	2.91	2.88	2.88	2.75	88.9	91.2	93.9	94.9	95.6
		Chen	7.06	18.8	3.86	3.52	2.89	2.81	2.79	0.00	88.7	91.5	94.8	94.7	95.3
	500	FULL	111	13.0	2.97	3.06	2.62	2.73	2.96	0.00	94.0	94.5	94.5	94.1	93.1
		SOLID	1.82	14.4	3.03	3.05	2.88	2.73	2.88	0.00	94.4	93.3	94.8	95.5	92.8
		SMA	25.8	6546	90.4	72.2	36.0	17.6	10.6	0.00	0.0	0.0	0.0	0.0	12.7
		Tang	83.4	504	3.73	3.49	2.92	2.85	2.78	2.75	88.7	91.6	94.2	94.6	94.7
		Chen	83.1	18.4	3.66	3.44	2.88	2.74	2.67	0.00	88.4	91.6	94.2	93.6	94.8
	1000	FULL	303	13.3	3.30	3.21	2.66	2.44	3.09	0.00	92.2	97.7	94.8	92.8	92.8
		SOLID	8.80	14.6	3.53	3.22	2.60	2.73	2.67	0.00	94.6	95.5	95.5	94.6	91.1
		SMA	114	30202	194	156	77.5	39.3	21.2	0.00	0.0	0.0	0.0	0.0	0.0
		Tang	233	1001	3.72	3.52	2.96	2.79	2.83	2.86	89.5	92.1	93.8	92.8	95.1
		Chen	231	19.1	3.67	3.40	2.90	2.79	2.70	0.00	88.5	93.1	93.1	94.7	95.4
III	50	FULL	9.29	9.56	2.65	2.52	2.59	2.48	2.62	0.00	95.1	95.4	95.8	94.9	92.9
		SOLID	0.08	9.49	2.65	2.49	2.58	2.47	2.53	0.00	95.1	95.7	95.6	94.9	93.3
		SMA	0.31	21.3	4.03	3.52	2.64	2.60	2.80	0.00	81.1	86.0	93.1	93.0	93.2
		Tang	7.33	44.1	2.96	2.70	2.47	2.59	2.50	2.61	93.0	94.2	94.1	94.4	95.4
		Chen	7.34	476	6.62	7.24	7.59	7.87	50.0	0.00	48.7	39.0	27.1	28.5	0.0
	500	FULL	149	10.5	2.41	2.50	2.30	2.60	3.03	0.00	95.3	94.1	95.1	95.6	92.2
		SOLID	1.72	10.3	2.56	2.54	2.40	2.59	2.96	0.00	94.7	95.6	95.0	94.9	91.2
		SMA	23.8	1145	37.1	29.1	14.9	7.59	6.06	0.00	0.0	0.0	0.0	35.1	68.0
		Tang	89.7	423	2.88	2.76	2.56	2.71	2.46	2.61	92.9	95.1	94.7	94.5	94.2
		Chen	87.9	476	6.69	7.20	7.81	7.73	50.0	0.00	43.1	38.3	25.3	33.8	0.0
	1000	FULL	262	9.78	2.72	2.69	2.87	2.54	3.15	0.00	89.1	95.2	96.7	93.9	87.6
		SOLID	8.43	10.1	2.45	2.53	2.35	2.44	2.83	0.00	93.3	94.4	94.4	95.6	93.3
		SMA	100	5215	80.0	64.5	32.1	16.6	10.6	0.00	0.0	0.0	0.0	0.0	10.0
		Tang	222	845	2.79	2.80	2.60	2.60	2.63	2.57	93.1	93.6	93.6	94.7	95.5
		Chen	219	476	6.84	7.35	7.48	7.80	50.0	0.00	37.9	40.8	33.3	28.8	0.0

					Bias						CP
	p	Method	Time	GMSE	\|$b_1$\|	\|$b_2$\|	\|$b_3$\|	\|$b_4$\|	\|$b_5$\|	\|$b_6$\|	\|$b_1$\|	\|$b_2$\|	\|$b_3$\|	\|$b_4$\|	\|$b_5$\|
I	50	FULL	8.97	5.11	2.78	2.63	2.50	2.51	2.72	0.00	94.0	94.3	95.2	95.0	93.2
		SOLID	0.06	5.54	3.02	2.74	2.53	2.51	2.59	0.00	95.6	95.6	94.9	95.3	95.7
		SMA	0.24	156	15.9	14.2	8.74	7.43	5.10	1.25	0.0	0.0	22.7	56.8	63.6
		Tang	7.01	44.2	3.18	3.06	2.69	2.60	2.71	2.57	93.7	93.9	94.6	95.3	94.7
		Chen	7.01	5.93	3.07	2.96	2.55	2.48	2.61	0.00	93.3	93.0	94.9	94.9	94.9
	500	FULL	103	5.42	2.83	2.87	2.51	2.58	2.63	0.00	95.1	95.3	94.8	94.7	94.3
		SOLID	1.66	5.33	2.78	2.66	2.51	2.54	2.61	0.00	94.6	95.6	94.3	94.5	92.5
		SMA	21.8	2927	91.1	74.3	40.3	23.7	15.6	0.07	0.0	0.0	0.0	0.0	0.0
		Tang	94.0	429	3.01	2.98	2.76	2.65	2.71	2.58	93.1	95.0	94.3	94.5	94.8
		Chen	92.9	5.60	2.90	2.81	2.58	2.55	2.54	0.00	93.1	94.8	93.5	94.3	94.8
	1000	FULL	539	5.11	2.87	2.73	2.30	2.32	2.39	0.00	87.6	88.0	96.8	92.2	94.0
		SOLID	9.41	5.53	2.89	2.73	2.64	2.70	2.71	0.00	94.2	94.2	93.4	95.0	94.2
		SMA	100	10365	180	145	77.0	42.8	25.6	0.01	0.0	0.0	0.0	0.0	0.0
		Tang	200	862	3.12	2.90	2.64	2.47	2.48	2.50	93.7	94.2	95.5	95.3	95.8
		Chen	206	5.34	3.01	2.77	2.53	2.37	2.36	0.00	94.7	92.6	94.5	94.7	95.0
II	50	FULL	9.41	12.8	3.30	2.79	2.79	2.86	2.79	0.00	96.0	94.4	94.8	95.1	93.9
		SOLID	0.06	13.0	3.14	2.99	2.98	2.70	2.80	0.00	95.1	95.2	94.9	95.0	94.8
		SMA	0.33	80.4	9.01	7.32	4.14	3.22	3.04	0.00	33.6	51.3	84.0	91.8	90.3
		Tang	7.08	59.1	3.90	3.56	2.91	2.88	2.88	2.75	88.9	91.2	93.9	94.9	95.6
		Chen	7.06	18.8	3.86	3.52	2.89	2.81	2.79	0.00	88.7	91.5	94.8	94.7	95.3
	500	FULL	111	13.0	2.97	3.06	2.62	2.73	2.96	0.00	94.0	94.5	94.5	94.1	93.1
		SOLID	1.82	14.4	3.03	3.05	2.88	2.73	2.88	0.00	94.4	93.3	94.8	95.5	92.8
		SMA	25.8	6546	90.4	72.2	36.0	17.6	10.6	0.00	0.0	0.0	0.0	0.0	12.7
		Tang	83.4	504	3.73	3.49	2.92	2.85	2.78	2.75	88.7	91.6	94.2	94.6	94.7
		Chen	83.1	18.4	3.66	3.44	2.88	2.74	2.67	0.00	88.4	91.6	94.2	93.6	94.8
	1000	FULL	303	13.3	3.30	3.21	2.66	2.44	3.09	0.00	92.2	97.7	94.8	92.8	92.8
		SOLID	8.80	14.6	3.53	3.22	2.60	2.73	2.67	0.00	94.6	95.5	95.5	94.6	91.1
		SMA	114	30202	194	156	77.5	39.3	21.2	0.00	0.0	0.0	0.0	0.0	0.0
		Tang	233	1001	3.72	3.52	2.96	2.79	2.83	2.86	89.5	92.1	93.8	92.8	95.1
		Chen	231	19.1	3.67	3.40	2.90	2.79	2.70	0.00	88.5	93.1	93.1	94.7	95.4
III	50	FULL	9.29	9.56	2.65	2.52	2.59	2.48	2.62	0.00	95.1	95.4	95.8	94.9	92.9
		SOLID	0.08	9.49	2.65	2.49	2.58	2.47	2.53	0.00	95.1	95.7	95.6	94.9	93.3
		SMA	0.31	21.3	4.03	3.52	2.64	2.60	2.80	0.00	81.1	86.0	93.1	93.0	93.2
		Tang	7.33	44.1	2.96	2.70	2.47	2.59	2.50	2.61	93.0	94.2	94.1	94.4	95.4
		Chen	7.34	476	6.62	7.24	7.59	7.87	50.0	0.00	48.7	39.0	27.1	28.5	0.0
	500	FULL	149	10.5	2.41	2.50	2.30	2.60	3.03	0.00	95.3	94.1	95.1	95.6	92.2
		SOLID	1.72	10.3	2.56	2.54	2.40	2.59	2.96	0.00	94.7	95.6	95.0	94.9	91.2
		SMA	23.8	1145	37.1	29.1	14.9	7.59	6.06	0.00	0.0	0.0	0.0	35.1	68.0
		Tang	89.7	423	2.88	2.76	2.56	2.71	2.46	2.61	92.9	95.1	94.7	94.5	94.2
		Chen	87.9	476	6.69	7.20	7.81	7.73	50.0	0.00	43.1	38.3	25.3	33.8	0.0
	1000	FULL	262	9.78	2.72	2.69	2.87	2.54	3.15	0.00	89.1	95.2	96.7	93.9	87.6
		SOLID	8.43	10.1	2.45	2.53	2.35	2.44	2.83	0.00	93.3	94.4	94.4	95.6	93.3
		SMA	100	5215	80.0	64.5	32.1	16.6	10.6	0.00	0.0	0.0	0.0	0.0	10.0
		Tang	222	845	2.79	2.80	2.60	2.60	2.63	2.57	93.1	93.6	93.6	94.7	95.5
		Chen	219	476	6.84	7.35	7.48	7.80	50.0	0.00	37.9	40.8	33.3	28.8	0.0

With respect to overall estimation efficiency, |$\widehat{\boldsymbol{\beta}}_{\sf \scriptscriptstyle{SOLID},j}$| attains GMSE comparable to |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle\Omega_{\sf \scriptscriptstyle +}}$| across all settings, while |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle \sf SMA}$|⁠, |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle \sf Chen}$|⁠, and |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle \sf Tang}$| attain generally larger and sometime substantially larger GMSE. For example, in Setting (I), both |$\widehat{\boldsymbol{\beta}}_{\sf \scriptscriptstyle{SOLID},j}$| and |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle \sf Chen}$| attained comparable GMSEs as |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle\Omega_{\sf \scriptscriptstyle +}}$|⁠. The SMA performs poorly for larger |$p$| as expected since it was not designed for larger |$p$|⁠. The debiased LASSO based DAC estimator |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle \sf Tang}$| also has poor performance due to the substantially increased variability and biases for the zero coefficients, resulting in a large aggregated error for the highly sparse settings. In Setting (III) with weaker signals, while |$\widehat{\boldsymbol{\beta}}_{\sf \scriptscriptstyle{SOLID},j}$| continues to attain MSEs and biases comparable to that of |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle\Omega_{\sf \scriptscriptstyle +}}$|⁠, all competing DAC methods performed poorly with drastically larger GMSE in the high dimensional setting. The |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle \sf Tang}$| estimator again suffers from larger biases for zero coefficients. On the other hand, |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle \sf Chen}$| can efficiently identify zero coefficients but has large biases for the non-zero coefficients. It is worth mentioning that both penalized estimators exhibit a small amount of bias. Such biases in the weak signals are expected for shrinkage estimators (Pavlou and others, 2016). However, it is important to note that |$\widehat{\boldsymbol{\beta}}_{\sf \scriptscriptstyle{SOLID},j}$| and |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle\Omega_{\sf \scriptscriptstyle +}}$| perform nearly identically, suggesting that our SOLID procedure incurs negligible additional approximation errors.

With respect to empirical probability of |$\jmath\not\in\widehat{{\cal A}}$|⁠, all methods under comparison detected non-zero signals equally well in that the percentage of non-zero estimates over 1000 simulations are |$100\%$| when the true regression coefficients are not zero. On the other hand, |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle \sf Tang}$| has difficulty in estimating zero signals. Bootstrap was used to estimate the empirical coverage level of the 95% normal confidence interval for |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle\Omega_{\sf \scriptscriptstyle +}}$|⁠. All methods under comparison have coverage probability close to the 95% nominal level except for SMA. This is expected because SMA tends to produce biased estimates when |$p$| is relatively large.

We summarize in Table 2 the performance of our proposed procedures for making inference on accuracy parameters. We compare the performances of the MCV procedure and the apparent accuracy estimation. The external validation is included only as a benchmark but is not feasible in practical settings. For a |$100$|-fold CV, the computational cost of proposed the MCV procedure is slightly higher than that of apparent estimation and external validation, but is substantially lower than that of the standard CV, confirming that leveraging pre-computed SOLID side products enables us to efficiently perform CV. In terms of accuracy estimation, the proposed MCV procedure produces substantially lower overfitting bias than the apparent estimation. The ROC curves obtained by the EXT, APP, and MCV procedures are displayed in Figure 1. As expected, the ROC curves obtained by EXT and MCV are nearly identical, while that obtained by APP is substantially different. In terms of standard error estimation, the approximate standard error is very close to the empirical standard error, suggesting that the approximate standard error estimates the true standard error well. For the coverage probability, the estimates of the apparent procedure has coverage probability far from the 95% nominal level, while the coverage of the proposed MCV estimates are very close to 95%.

Fig. 1.

Point estimates and 95% confidence interval of the top 20 non-zero beta coefficient estimates with largest magnitudes obtained by the proposed SOLID method and its competing methods.

Table 2.

Comparisons among the external validation (EXT), apparent validation (APP), and MCV in regards to averaged computing time, averaged difference compared to the external validation (DIFF), empirical standard error (ESE), averaged approximate standard error (ASE), and coverage probability (CP) of area under the curve (AUC) and sensitivity.

		AUC				Sensitivity
Method	Time	DIFF	ESE	ASE	CP (%)	DIFF	ESE	ASE	CP (%)
EXT	7.5	—	0.00413	—	—	—	0.00320	—	—
APP	7.6	0.00927	0.00468	—	52.1	0.03055	0.01779	—	62.7
MCV	35.0	0.00006	0.00448	0.00422	92.3	\|$-$\|0.00122	0.01536	0.01447	92.9

		AUC				Sensitivity
Method	Time	DIFF	ESE	ASE	CP (%)	DIFF	ESE	ASE	CP (%)
EXT	7.5	—	0.00413	—	—	—	0.00320	—	—
APP	7.6	0.00927	0.00468	—	52.1	0.03055	0.01779	—	62.7
MCV	35.0	0.00006	0.00448	0.00422	92.3	\|$-$\|0.00122	0.01536	0.01447	92.9

Table 2.

Open in new tab Download slide

Comparisons among the external validation (EXT), apparent validation (APP), and MCV in regards to averaged computing time, averaged difference compared to the external validation (DIFF), empirical standard error (ESE), averaged approximate standard error (ASE), and coverage probability (CP) of area under the curve (AUC) and sensitivity.

		AUC				Sensitivity
Method	Time	DIFF	ESE	ASE	CP (%)	DIFF	ESE	ASE	CP (%)
EXT	7.5	—	0.00413	—	—	—	0.00320	—	—
APP	7.6	0.00927	0.00468	—	52.1	0.03055	0.01779	—	62.7
MCV	35.0	0.00006	0.00448	0.00422	92.3	\|$-$\|0.00122	0.01536	0.01447	92.9

		AUC				Sensitivity
Method	Time	DIFF	ESE	ASE	CP (%)	DIFF	ESE	ASE	CP (%)
EXT	7.5	—	0.00413	—	—	—	0.00320	—	—
APP	7.6	0.00927	0.00468	—	52.1	0.03055	0.01779	—	62.7
MCV	35.0	0.00006	0.00448	0.00422	92.3	\|$-$\|0.00122	0.01536	0.01447	92.9

The proposed SOLID algorithm was implemented as an R software package solid, which is available at https://github.com/celehs/solid.

4. Data application

The EMR contains rich clinical information on patients including medical history, vital signs, lab test results, and clinical notes. Accurately assigning ICD codes for individual patient visits is highly important on many levels, from ensuring the integrity of the billing process to creating an accurate record of patient medical history. However, the coding process is tedious, subjective, and currently relies largely on manual efforts from professional medical coders. The large volume of medical records makes manual ICD assignment a labor-intensive and costly process, signifying the need for methods to automate the coding process. Narrative features can be extracted from free text such as radiology reports via natural language processing (NLP). To explore the feasibility of automatic code assignment based on narrative text, we aim to develop an ICD classification model for depression based on high dimensional NLP features, and to evaluate its accuracy using EMR data from Partner’s Healthcare Biobank (PHB).

The analysis data consisted of ICD codes and NLP features for 1 000 000 visits, which was a random sample of |$2\,096\,563$| visits from PHB. The remaining |$1\,096\,563$| visits were considered as a validation set for model evaluation. The response |$Y$| was the indicator for having at least one ICD code for depression on each visit and the prevalence was about |$4\%$|⁠. A list of |$1410$| candidate medical concepts were extracted by performing named entity recognition on five articles related to depression from online sources including Wikipedia, MedlinePlus, Medscape, Merck, and Mayo Clinic, following the strategy of Yu and others (2016). We then performed NLP on the EMR narrative notes to count the occurrences of these NLP features in each visit. After quality control and pre-screening, we retained a total of 142 NLP features that appeared in at least 5% of the visits. Because the NLP counts tend to be zero-inflated and highly skewed, we took |$x \mapsto \log(x + 1)$| transformation on all features. It is worth mentioning that for the visit level data, the variables were not independent and identically distributed, since the measurements for each patients were correlated. Therefore, we fitted a penalized generalized linear model with independent working covariance for estimation and the sandwich estimator for variance calculation. We set an effective sample size with the number of patients instead of the number of visits for tuning parameter selection and let |$K = 100$| for SOLID estimation. The full sample estimator can only be calculated with a very large memory capacity (120 GB) available.

For regression coefficients, we compare |$\widehat{\boldsymbol{\beta}}_{\sf \scriptscriptstyle{SOLID},j}$|⁠, |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle\Omega_{\sf \scriptscriptstyle +}}$|⁠, |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle \sf Chen}$|⁠, |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle \sf Tang}$|⁠, and |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle \sf SMA}$| with regards to computation time, point estimates, and confidence intervals. The proposed SOLID method takes 31 s, while all the other methods take longer times (34 min, 1532 s, 1476 s, and 116 s for |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle\Omega_{\sf \scriptscriptstyle +}}$|⁠, |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle \sf Chen}$|⁠, |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle \sf Tang}$|⁠, and |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle \sf SMA}$|⁠, respectively). Bootstrap was used to calculate the standard errors of |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle\Omega_{\sf \scriptscriptstyle +}}$|⁠. We present the point estimates for all methods and the confidence intervals for |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle\Omega_{\sf \scriptscriptstyle +}}$| and |$\widehat{\boldsymbol{\beta}}_{\sf \scriptscriptstyle{SOLID},j}$| for the 20 predictors with the largest estimated effect sizes in Figure 1. The results for the remaining predictors are summarized in the Supplementary Materials available at Biostatistics online. As expected, for most predictors among the top 20 with largest effect sizes, the point estimates, and confidence intervals of |$\widehat{\boldsymbol{\beta}}_{\sf \scriptscriptstyle{SOLID},j}$| are very close to those of |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle\Omega_{\sf \scriptscriptstyle +}}$|⁠. On the other hand, |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle \sf Chen}$|⁠, |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle \sf Tang}$|⁠, and |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle \sf SMA}$| are different with regards to point estimates compared with |$\widehat{\boldsymbol{\beta}}_{\scriptscriptstyle\Omega_{\sf \scriptscriptstyle +}}$|⁠.

We conduct model evaluation of the SOLID algorithm using MCV. For benchmark, we evaluate the full-sample based adaptive lasso algorithm using external validation. We present the receiver operating characteristic curve in Figure 2. The AUCs of proposed method and full-sample based adaptive lasso are 0.908 (95% CI [0.905, 0.911]) and 0.907 (95% CI [0.906, 0.909]), respectively. At FPR = 0.1, the TPRs of the proposed method and the full-sample based adaptive lasso are 0.799 (95% CI [0.792, 0.807]) and 0.797 (95% CI [0.793, 0.800]), respectively; at FPR = 0.15, the TPRs of proposed method and the full-sample based adaptive lasso are 0.864 (95% CI [0.858, 0.870]) and 0.860 (95% CI [0.857, 0.863]). This observation suggests that the accuracy estimates of the SOLID algorithm obtained by the MCV procedure are very close to those of the full-sample based procedure obtained by external validation.

Fig. 2.

Averaged ROC curves of external validate (EXT), apparent validation (APP), and modified cross validation (MCV) over 1000 simulations.

5. Discussion

In this article, we propose a novel SOLID method for sparse risk prediction and also develop an efficient MCV for model evaluation. The proposed SOLID for fitting adaptive LASSO reduces the computation cost while maintaining the precision of estimation with an extraordinarily large |$n_0$| and a numerically large |$p$|⁠. The use of a screening step and one-step linearization fused DAC makes it feasible to obtain a SOLID estimator that attains equivalent precision to that of the full sample estimator with a many-fold reduction of computation time. The purpose of the screening steps (1.1)–(1.4) is to choose the active set, while that of the post screening steps (2.1) and (2.2) is to get the refined estimation results. Specifically, although the active set estimate |$\widehat{{\cal A}}$| from the screening step consistently estimates |$\mathcal{A}$| asymptotically, it may not be sufficiently accurate since it is derived based on a “rough” one-step estimator |$\widetilde{\boldsymbol{\beta}}_{\scriptscriptstyle\Omega_{\sf \scriptscriptstyle +}}^{\sf\scriptscriptstyle lin,1}$|⁠, which has a convergence rate of |$\|\widetilde{\boldsymbol{\beta}}_{\scriptscriptstyle\Omega_{\sf \scriptscriptstyle +}}^{\sf\scriptscriptstyle lin,1}-\boldsymbol{\beta}_0\|_2 = O_p(p/n_{\Omega_1})$|⁠, slower than that of |$\widetilde{\boldsymbol{\beta}}_{\scriptscriptstyle\Omega_{\sf \scriptscriptstyle +}}$| when |$n_{\Omega_1}$| is not very large. However, this step is critical in improving the computational speed of SOLID since step (2.1) can be performed much faster when constraining to the estimated active set |$\widehat{{\cal A}}$|⁠. The time costs needed to implement each step of the SOLID algorithm in comparison with the direct DAC without screening are summarized in the Supplementary Materials available at Biostatistics online. To examine the performance of variable selection in the screening procedure, we summarize the size of the estimated active set, sensitivity (the proportion of actual active features that are correctly identified by the algorithm), specificity (the proportion of nonactive features that are not identified by the algorithm) in the Supplementary Materials available at Biostatistics online.

The estimator substantially outperforms existing DAC procedures with respect to both computational time and estimation accuracy. One major difference between the SOLID algorithm and the existing DAC algorithm is that we combine data from all subsets to calculate the information matrix |${\widehat A}_{\sf \scriptscriptstyle{DAC}}$|⁠, while other methods only perform estimations on individual subsets which essentially rely on an information matrix estimated from the subsets. When |$p$| is large, the approximation error of the information matrix estimated from each subset is substantially larger compared to the approximation error of the aggregated counterpart. We find that the improved precision of |${\widehat A}_{\sf \scriptscriptstyle{DAC}}$| contributes significantly to the performance of the SOLID estimator, especially for settings with weaker signals. For example, Chen and Xie (2014) essentially uses individual sets to calculate the information matrix and the score and then combine estimates across |$K$| parts via majority voting. Compared to their estimator, the SOLID estimator has a much smaller MSE when signals are weaker (simulation settings 2 and 3) and |$p$| is large.

For our numerical studies, we required |$K$| to be of order |$o(n_0^{1/2})$| and fixed it to be |$100$| with |$p=50$| and |$500$| and |$50$| when |$p=1000$|⁠. Here, we give a few practical guidelines to the choice of |$K$|⁠. The partition size |$K$| should be chosen to ensure that |$n$| is reasonably larger than |$p$|⁠, say |$n >5p$|⁠, as well as large enough to benefit from distributed execution. In our numerical studies, we did not use parallel computing for calculating the statistics on the individual subsets due to computing resource constraints. However, we highly recommend using multicore and multi-node parallel computing to maximize the efficiency of the SOLID algorithm. In step (1.1) of the screening step, we used data in |$\Omega_1$| to obtain an initial estimator. When the event rate is low and |$p$| is large, one may combine several subsets to calculate the initial estimator to improve stability. When |$p$| is also very large, it is possible that the proposed SOLID algorithm is still subject to computational limitations with |$n > p$|⁠. For such settings, additional refinement such as those adopted in sure independence screening (Fan and Lv, 2008). Future research is needed for additional model assumptions that may be needed to improve the screening step.

Supplementary material

Supplementary material is available at http://biostatistics.oxfordjournals.org.

References

Caner,

M.

and

Zhang,

H. H.

(

2014

).

Adaptive elastic net for generalized methods of moments

.

Journal of Business & Economic Statistics

32

,

30

–

47

.

Chen,

X.

and

Xie,

M.

(

2014

).

A split-and-conquer approach for analysis of extraordinarily large data

.

Statistica Sinica

24

,

1655

–

1684

.

Cui,

Y.

,

Chen,

X.

and

Yan,

L.

(

2017

).

Adaptive lasso for generalized linear models with a diverging number of parameters

.

Communications in Statistics-Theory and Methods

46

,

11826

–

11842

.

Fan,

J.

and

Li,

R.

(

2001

).

Variable selection via nonconcave penalized likelihood and its oracle properties

.

Journal of the American statistical Association

96

,

1348

–

1360

.

Fan,

J.

and

Lv,

J.

(

2008

).

Sure independence screening for ultrahigh dimensional feature space

.

Journal of the Royal Statistical Society: Series B (Statistical Methodology)

70

,

849

–

911

.

He,

Q.

,

Zhang,

H. H.

,

Avery,

C. L.

and

Lin,

D.Y.

(

2016

).

Sparse meta-analysis with high-dimensional data

.

Biostatistics

17

,

205

–

220

.

Lee,

J.D.

,

Liu,

Q.

,

Sun,

Y.

and

Taylor,

J.E.,

(

2017

).

Communication-efficient sparse regression

.

The Journal of Machine Learning Research

,

18

(

1

), pp.

115

–

144

.

Pavlou,

M.

,

Ambler,

G.

,

Seaman,

S.

,

De Iorio,

M.

and

Omar,

R. Z.

(

2016

).

Review and evaluation of penalised regression methods for risk prediction in low-dimensional data with few events

.

Statistics in Medicine

35

,

1159

–

1177

.

Pepe,

M.S.

, (

2003

).

The statistical evaluation of medical tests for classification and prediction

.

Medicine. Oxford University Press

.

Google Preview

Tang,

L.

,

Zhou,

L.

and

Song,

P. X.-K.

(

2016

).

Method of divide-and-combine in regularised generalised linear models for big data

. arXiv preprint

arXiv

.

Tian,

L.

,

Cai,

T.

,

Goetghebeur,

E.

and

Wei,

L. J.

(

2007

).

Model evaluation based on the sampling distribution of estimated absolute prediction error

.

Biometrika

94

,

297

–

311

.

Tibshirani,

R.

(

1996

).

Regression shrinkage and selection via the lasso

.

Journal of the Royal Statistical Society: Series B (Methodological)

58

,

267

–

288

.

Uno,

H.

,

Cai,

T.

,

Tian,

L.

and

Wei,

L. J.

(

2007

).

Evaluating prediction rules for t-year survivors with censored regression models

.

Journal of the American Statistical Association

102

,

527

–

537

.

Van de Geer,

S.

,

Bühlmann,

P.

,

Ritov,

Y.

,

Dezeure,

R.

and others. (

2014

).

On asymptotically optimal confidence regions and tests for high-dimensional models

.

The Annals of Statistics

42

,

1166

–

1202

.

Wang,

H.

and

Leng,

C.

(

2007

).

Unified lasso estimation by least squares approximation

.

Journal of the American Statistical Association

102

,

1039

–

1048

.

Wang,

X.

,

Peng,

P.

and

Dunson,

D. B.

(

2014

). Median selection subset aggregation for parallel inference. In:

Ghahramani

Z.

,

Welling

M.

,

Cortes

C.

,

Lawrence

N.D.

and

Weinberger

K.Q.

(editors),

Advances in Neural Infor- mation Processing Systems

.

Red Hook, NY

:

Curran Associates, Inc.

, pp.

2195

–

2203

.

Google Preview

Wang,

Y.

,

Palmer,

N.

,

Di,

Q.

,

Schwartz,

J.

,

Kohane,

I.

and

Cai,

T.

(

2021

).

A fast divide-and-conquer sparse Cox regression

.

Biostatistics

,

22

, 381–401.

Xie,

M.

,

Singh,

K.

and

Strawderman,

W. E.

(

2011

).

Confidence distributions and a unifying framework for meta-analysis

.

Journal of the American Statistical Association

106

(

493

),

320

–

333

.

Yu,

S.

,

Chakrabortty,

A.

,

Liao,

K. P.

,

Cai,

T.

,

Ananthakrishnan,

A. N.

,

Gainer,

V. S.

,

Churchill,

S. E.

,

Szolovits,

P.

,

Murphy,

S. N.

,

Kohane,

I. S.

and others. (

2016

). Surrogate-assisted feature extraction for high-throughput phenotyping.

Journal of the American Medical Informatics Association

24

(

e1

),

e143

–

e149

.

Zou,

H.

(

2006

).

The adaptive lasso and its oracle properties

.

Journal of the American Statistical Association

101

,

1418

–

1429

.

Zou,

H.

and

Hastie,

T.

(

2005

).

Regularization and variable selection via the elastic net

.

Journal of the Royal Statistical Society: Series B (Statistical Methodology)

67

,

301

–

320

.