Score Test for Assessing the Conditional Dependence in Latent Class Models and its Application to Record Linkage

Mean and SD of the difference in Brier score (BS) for Fellegi–Sunter (FS) model and conditional dependence latent class models (Model 1: pseudo- $R^{2} \geq$ 1%; Model 2: pseudo- $R^{2} \geq$ 0.5%; Model 3: pseudo- $R^{2} \geq$ 0.1%) relative to the true model and median number of interactions selected in Scenario I, II and III

True match prevalence	Model	Scenario I				Scenario II				Scenario III
		BS		Interactions		BS		Interactions		BS		Interactions
		Mean	SD	Total	Correct	Mean	SD	Total	Correct	Mean	SD	Total	Correct
0.01	FS	0.1269	0.00100			0.0002	0.00002			0.1298	0.00100
	Model 1	0.1269	0.00100	0	0	0.0002	0.00002	0	0	0.1298	0.00100	0	0
	Model 2	0.1261	0.00142	0	0	0.0002	0.00002	0	0	0.1276	0.00095	1	0
	Model 3	0.1253	0.00097	1	0	0.0002	0.00002	0	0	0.1276	0.00095	1	0
0.05	FS	0.0686	0.00076			0.0006	0.00002			0.0713	0.00075
	Model 1	0.0939	0.00068	1	0	0.0006	0.00002	0	0	0.0967	0.00066	1	0
	Model 2	0.0939	0.00068	1	0	0.0006	0.00002	0	0	0.0967	0.00066	1	0
	Model 3	0.0677	0.00661	5	2	0.0004	0.00002	1	1	0.0719	0.00342	5	2
0.1	FS	0.0118	0.00345			0.0009	0.00003			0.0085	0.00252
	Model 1	0.0573	0.00725	2	1	0.0009	0.00003	0	0	0.0577	0.00388	2	1
	Model 2	0.0334	0.00082	4	3	0.0009	0.00003	0	0	0.0476	0.00122	4	3
	Model 3	0.0351	0.03873	7	5	0.0007	0.00003	1	1	0.0110	0.00975	7	6
0.3	FS	0.0043	0.00007			0.0023	0.00005			0.0025	0.00007
	Model 1	0.0427	0.00089	2	1	0.0023	0.00005	0	0	0.0427	0.00282	2	1
	Model 2	0.0137	0.00056	5	4	0.0018	0.00005	1	1	0.0357	0.00095	4	3
	Model 3	0.0003	0.00002	7	6	0.0031	0.00019	4	3	0.0038	0.00288	7	6
0.5	FS	0.0033	0.00006			0.0025	0.00005			0.0013	0.00004
	Model 1	0.0328	0.00023	2	1	0.0017	0.00004	1	1	0.0334	0.00024	1	0
	Model 2	0.0181	0.01162	4	3	0.0017	0.00004	1	1	0.0288	0.00023	2	1
	Model 3	0.0002	0.00002	7	6	0.0030	0.00010	6	5	0.0194	0.00116	6	5
0.7	FS	0.0021	0.00005			0.0019	0.00006			0.0005	0.00002
	Model 1	0.0197	0.00019	2	1	0.0011	0.00004	1	1	0.0193	0.00020	1	0
	Model 2	0.0158	0.00038	4	3	0.0160	0.00024	2	1	0.0193	0.00020	1	0
	Model 3	0.0008	0.00011	7	6	0.0000	0.00000	7	6	0.0146	0.00062	4	3
0.9	FS	0.0014	0.00007			0.0012	0.00007			0.0002	0.00001
	Model 1	0.0245	0.00040	2	1	0.0458	0.00114	2	1	0.0002	0.00001	0	0
	Model 2	0.0129	0.00018	5	4	0.0158	0.00022	3	2	0.0015	0.00253	0	0
	Model 3	0.0002	0.00006	7	6	0.0000	0.00000	7	6	0.0055	0.00040	2	1
0.95	FS	0.0077	0.00050			0.0067	0.00061			0.0002	0.00001
	Model 1	0.0593	0.00060	2	1	0.0616	0.00069	2	1	0.0002	0.00001	0	0
	Model 2	0.0130	0.00081	4	3	0.0202	0.00039	3	2	0.0002	0.00001	0	0
	Model 3	0.0001	0.00003	7	6	0.0032	0.00331	7	6	0.0030	0.00007	1	0
0.99	FS	0.3975	0.00142			0.4009	0.00140			0.0001	0.00001
	Model 1	0.3989	0.00139	1	0	0.4012	0.00137	1	0	0.0001	0.00001	0	0
	Model 2	0.1289	0.00503	3	1	0.1232	0.00950	3	1	0.0001	0.00001	0	0
	Model 3	0.1401	0.00283	8	2	0.1330	0.00481	10	1	0.0001	0.00001	0	0

True match prevalence	Model	Scenario I				Scenario II				Scenario III
		BS		Interactions		BS		Interactions		BS		Interactions
		Mean	SD	Total	Correct	Mean	SD	Total	Correct	Mean	SD	Total	Correct
0.01	FS	0.1269	0.00100			0.0002	0.00002			0.1298	0.00100
	Model 1	0.1269	0.00100	0	0	0.0002	0.00002	0	0	0.1298	0.00100	0	0
	Model 2	0.1261	0.00142	0	0	0.0002	0.00002	0	0	0.1276	0.00095	1	0
	Model 3	0.1253	0.00097	1	0	0.0002	0.00002	0	0	0.1276	0.00095	1	0
0.05	FS	0.0686	0.00076			0.0006	0.00002			0.0713	0.00075
	Model 1	0.0939	0.00068	1	0	0.0006	0.00002	0	0	0.0967	0.00066	1	0
	Model 2	0.0939	0.00068	1	0	0.0006	0.00002	0	0	0.0967	0.00066	1	0
	Model 3	0.0677	0.00661	5	2	0.0004	0.00002	1	1	0.0719	0.00342	5	2
0.1	FS	0.0118	0.00345			0.0009	0.00003			0.0085	0.00252
	Model 1	0.0573	0.00725	2	1	0.0009	0.00003	0	0	0.0577	0.00388	2	1
	Model 2	0.0334	0.00082	4	3	0.0009	0.00003	0	0	0.0476	0.00122	4	3
	Model 3	0.0351	0.03873	7	5	0.0007	0.00003	1	1	0.0110	0.00975	7	6
0.3	FS	0.0043	0.00007			0.0023	0.00005			0.0025	0.00007
	Model 1	0.0427	0.00089	2	1	0.0023	0.00005	0	0	0.0427	0.00282	2	1
	Model 2	0.0137	0.00056	5	4	0.0018	0.00005	1	1	0.0357	0.00095	4	3
	Model 3	0.0003	0.00002	7	6	0.0031	0.00019	4	3	0.0038	0.00288	7	6
0.5	FS	0.0033	0.00006			0.0025	0.00005			0.0013	0.00004
	Model 1	0.0328	0.00023	2	1	0.0017	0.00004	1	1	0.0334	0.00024	1	0
	Model 2	0.0181	0.01162	4	3	0.0017	0.00004	1	1	0.0288	0.00023	2	1
	Model 3	0.0002	0.00002	7	6	0.0030	0.00010	6	5	0.0194	0.00116	6	5
0.7	FS	0.0021	0.00005			0.0019	0.00006			0.0005	0.00002
	Model 1	0.0197	0.00019	2	1	0.0011	0.00004	1	1	0.0193	0.00020	1	0
	Model 2	0.0158	0.00038	4	3	0.0160	0.00024	2	1	0.0193	0.00020	1	0
	Model 3	0.0008	0.00011	7	6	0.0000	0.00000	7	6	0.0146	0.00062	4	3
0.9	FS	0.0014	0.00007			0.0012	0.00007			0.0002	0.00001
	Model 1	0.0245	0.00040	2	1	0.0458	0.00114	2	1	0.0002	0.00001	0	0
	Model 2	0.0129	0.00018	5	4	0.0158	0.00022	3	2	0.0015	0.00253	0	0
	Model 3	0.0002	0.00006	7	6	0.0000	0.00000	7	6	0.0055	0.00040	2	1
0.95	FS	0.0077	0.00050			0.0067	0.00061			0.0002	0.00001
	Model 1	0.0593	0.00060	2	1	0.0616	0.00069	2	1	0.0002	0.00001	0	0
	Model 2	0.0130	0.00081	4	3	0.0202	0.00039	3	2	0.0002	0.00001	0	0
	Model 3	0.0001	0.00003	7	6	0.0032	0.00331	7	6	0.0030	0.00007	1	0
0.99	FS	0.3975	0.00142			0.4009	0.00140			0.0001	0.00001
	Model 1	0.3989	0.00139	1	0	0.4012	0.00137	1	0	0.0001	0.00001	0	0
	Model 2	0.1289	0.00503	3	1	0.1232	0.00950	3	1	0.0001	0.00001	0	0
	Model 3	0.1401	0.00283	8	2	0.1330	0.00481	10	1	0.0001	0.00001	0	0

TABLE 1

Mean and SD of the difference in Brier score (BS) for Fellegi–Sunter (FS) model and conditional dependence latent class models (Model 1: pseudo- $R^{2} \geq$ 1%; Model 2: pseudo- $R^{2} \geq$ 0.5%; Model 3: pseudo- $R^{2} \geq$ 0.1%) relative to the true model and median number of interactions selected in Scenario I, II and III

True match prevalence	Model	Scenario I				Scenario II				Scenario III
		BS		Interactions		BS		Interactions		BS		Interactions
		Mean	SD	Total	Correct	Mean	SD	Total	Correct	Mean	SD	Total	Correct
0.01	FS	0.1269	0.00100			0.0002	0.00002			0.1298	0.00100
	Model 1	0.1269	0.00100	0	0	0.0002	0.00002	0	0	0.1298	0.00100	0	0
	Model 2	0.1261	0.00142	0	0	0.0002	0.00002	0	0	0.1276	0.00095	1	0
	Model 3	0.1253	0.00097	1	0	0.0002	0.00002	0	0	0.1276	0.00095	1	0
0.05	FS	0.0686	0.00076			0.0006	0.00002			0.0713	0.00075
	Model 1	0.0939	0.00068	1	0	0.0006	0.00002	0	0	0.0967	0.00066	1	0
	Model 2	0.0939	0.00068	1	0	0.0006	0.00002	0	0	0.0967	0.00066	1	0
	Model 3	0.0677	0.00661	5	2	0.0004	0.00002	1	1	0.0719	0.00342	5	2
0.1	FS	0.0118	0.00345			0.0009	0.00003			0.0085	0.00252
	Model 1	0.0573	0.00725	2	1	0.0009	0.00003	0	0	0.0577	0.00388	2	1
	Model 2	0.0334	0.00082	4	3	0.0009	0.00003	0	0	0.0476	0.00122	4	3
	Model 3	0.0351	0.03873	7	5	0.0007	0.00003	1	1	0.0110	0.00975	7	6
0.3	FS	0.0043	0.00007			0.0023	0.00005			0.0025	0.00007
	Model 1	0.0427	0.00089	2	1	0.0023	0.00005	0	0	0.0427	0.00282	2	1
	Model 2	0.0137	0.00056	5	4	0.0018	0.00005	1	1	0.0357	0.00095	4	3
	Model 3	0.0003	0.00002	7	6	0.0031	0.00019	4	3	0.0038	0.00288	7	6
0.5	FS	0.0033	0.00006			0.0025	0.00005			0.0013	0.00004
	Model 1	0.0328	0.00023	2	1	0.0017	0.00004	1	1	0.0334	0.00024	1	0
	Model 2	0.0181	0.01162	4	3	0.0017	0.00004	1	1	0.0288	0.00023	2	1
	Model 3	0.0002	0.00002	7	6	0.0030	0.00010	6	5	0.0194	0.00116	6	5
0.7	FS	0.0021	0.00005			0.0019	0.00006			0.0005	0.00002
	Model 1	0.0197	0.00019	2	1	0.0011	0.00004	1	1	0.0193	0.00020	1	0
	Model 2	0.0158	0.00038	4	3	0.0160	0.00024	2	1	0.0193	0.00020	1	0
	Model 3	0.0008	0.00011	7	6	0.0000	0.00000	7	6	0.0146	0.00062	4	3
0.9	FS	0.0014	0.00007			0.0012	0.00007			0.0002	0.00001
	Model 1	0.0245	0.00040	2	1	0.0458	0.00114	2	1	0.0002	0.00001	0	0
	Model 2	0.0129	0.00018	5	4	0.0158	0.00022	3	2	0.0015	0.00253	0	0
	Model 3	0.0002	0.00006	7	6	0.0000	0.00000	7	6	0.0055	0.00040	2	1
0.95	FS	0.0077	0.00050			0.0067	0.00061			0.0002	0.00001
	Model 1	0.0593	0.00060	2	1	0.0616	0.00069	2	1	0.0002	0.00001	0	0
	Model 2	0.0130	0.00081	4	3	0.0202	0.00039	3	2	0.0002	0.00001	0	0
	Model 3	0.0001	0.00003	7	6	0.0032	0.00331	7	6	0.0030	0.00007	1	0
0.99	FS	0.3975	0.00142			0.4009	0.00140			0.0001	0.00001
	Model 1	0.3989	0.00139	1	0	0.4012	0.00137	1	0	0.0001	0.00001	0	0
	Model 2	0.1289	0.00503	3	1	0.1232	0.00950	3	1	0.0001	0.00001	0	0
	Model 3	0.1401	0.00283	8	2	0.1330	0.00481	10	1	0.0001	0.00001	0	0

True match prevalence	Model	Scenario I				Scenario II				Scenario III
		BS		Interactions		BS		Interactions		BS		Interactions
		Mean	SD	Total	Correct	Mean	SD	Total	Correct	Mean	SD	Total	Correct
0.01	FS	0.1269	0.00100			0.0002	0.00002			0.1298	0.00100
	Model 1	0.1269	0.00100	0	0	0.0002	0.00002	0	0	0.1298	0.00100	0	0
	Model 2	0.1261	0.00142	0	0	0.0002	0.00002	0	0	0.1276	0.00095	1	0
	Model 3	0.1253	0.00097	1	0	0.0002	0.00002	0	0	0.1276	0.00095	1	0
0.05	FS	0.0686	0.00076			0.0006	0.00002			0.0713	0.00075
	Model 1	0.0939	0.00068	1	0	0.0006	0.00002	0	0	0.0967	0.00066	1	0
	Model 2	0.0939	0.00068	1	0	0.0006	0.00002	0	0	0.0967	0.00066	1	0
	Model 3	0.0677	0.00661	5	2	0.0004	0.00002	1	1	0.0719	0.00342	5	2
0.1	FS	0.0118	0.00345			0.0009	0.00003			0.0085	0.00252
	Model 1	0.0573	0.00725	2	1	0.0009	0.00003	0	0	0.0577	0.00388	2	1
	Model 2	0.0334	0.00082	4	3	0.0009	0.00003	0	0	0.0476	0.00122	4	3
	Model 3	0.0351	0.03873	7	5	0.0007	0.00003	1	1	0.0110	0.00975	7	6
0.3	FS	0.0043	0.00007			0.0023	0.00005			0.0025	0.00007
	Model 1	0.0427	0.00089	2	1	0.0023	0.00005	0	0	0.0427	0.00282	2	1
	Model 2	0.0137	0.00056	5	4	0.0018	0.00005	1	1	0.0357	0.00095	4	3
	Model 3	0.0003	0.00002	7	6	0.0031	0.00019	4	3	0.0038	0.00288	7	6
0.5	FS	0.0033	0.00006			0.0025	0.00005			0.0013	0.00004
	Model 1	0.0328	0.00023	2	1	0.0017	0.00004	1	1	0.0334	0.00024	1	0
	Model 2	0.0181	0.01162	4	3	0.0017	0.00004	1	1	0.0288	0.00023	2	1
	Model 3	0.0002	0.00002	7	6	0.0030	0.00010	6	5	0.0194	0.00116	6	5
0.7	FS	0.0021	0.00005			0.0019	0.00006			0.0005	0.00002
	Model 1	0.0197	0.00019	2	1	0.0011	0.00004	1	1	0.0193	0.00020	1	0
	Model 2	0.0158	0.00038	4	3	0.0160	0.00024	2	1	0.0193	0.00020	1	0
	Model 3	0.0008	0.00011	7	6	0.0000	0.00000	7	6	0.0146	0.00062	4	3
0.9	FS	0.0014	0.00007			0.0012	0.00007			0.0002	0.00001
	Model 1	0.0245	0.00040	2	1	0.0458	0.00114	2	1	0.0002	0.00001	0	0
	Model 2	0.0129	0.00018	5	4	0.0158	0.00022	3	2	0.0015	0.00253	0	0
	Model 3	0.0002	0.00006	7	6	0.0000	0.00000	7	6	0.0055	0.00040	2	1
0.95	FS	0.0077	0.00050			0.0067	0.00061			0.0002	0.00001
	Model 1	0.0593	0.00060	2	1	0.0616	0.00069	2	1	0.0002	0.00001	0	0
	Model 2	0.0130	0.00081	4	3	0.0202	0.00039	3	2	0.0002	0.00001	0	0
	Model 3	0.0001	0.00003	7	6	0.0032	0.00331	7	6	0.0030	0.00007	1	0
0.99	FS	0.3975	0.00142			0.4009	0.00140			0.0001	0.00001
	Model 1	0.3989	0.00139	1	0	0.4012	0.00137	1	0	0.0001	0.00001	0	0
	Model 2	0.1289	0.00503	3	1	0.1232	0.00950	3	1	0.0001	0.00001	0	0
	Model 3	0.1401	0.00283	8	2	0.1330	0.00481	10	1	0.0001	0.00001	0	0

TABLE 2

Mean and SD of the difference in Brier score (BS) for Fellegi–Sunter (FS) model and conditional dependence latent class models (Model 1: pseudo- $R^{2} \geq$ 1%; Model 2: pseudo- $R^{2} \geq$ 0.5%; Model 3: pseudo- $R^{2} \geq$ 0.1%;) relative to the true model and median number of interactions selected in Scenario IV

Match prevalence	Model	BS		Interactions selected
Match prevalence	Model	Mean	SD	Total	Correct
0.01	FS	0.0062	0.00017
	Model 1	0.0008	0.00005	1	1
	Model 2	0.0048	0.00014	2	1
	Model 3	0.0091	0.00078	8	3
0.05	FS	0.0012	0.00004
	Model 1	0.0011	0.00097	1	1
	Model 2	0.0013	0.00004	4	3
	Model 3	0.0003	0.00002	7	6
0.10	FS	0.0016	0.00004
	Model 1	0.0011	0.00003	1	1
	Model 2	0.0013	0.00006	4	3
	Model 3	0.0003	0.00012	7	6
0.30	FS	0.0029	0.00006
	Model 1	0.0070	0.00011	2	1
	Model 2	0.0040	0.00008	4	3
	Model 3	0.0016	0.00007	7	6
0.50	FS	0.0042	0.00007
	Model 1	0.0060	0.00051	3	2
	Model 2	0.0046	0.00019	5	4
	Model 3	0.0002	0.00012	7	6
0.70	FS	0.0058	0.00009
	Model 1	0.0071	0.00010	3	2
	Model 2	0.0027	0.00169	5	4
	Model 3	0.0003	0.00011	6	5
0.90	FS	0.0089	0.00012
	Model 1	0.0091	0.00013	3	2
	Model 2	0.0082	0.00018	4	3
	Model 3	0.0008	0.00010	6	5
0.95	FS	0.0255	0.00035
	Model 1	0.0278	0.00029	2	1
	Model 2	0.0182	0.00026	5	2
	Model 3	0.0105	0.00985	10	5
0.99	FS	0.0656	0.00043
	Model 1	0.0683	0.00041	1	0
	Model 2	0.0683	0.00041	1	0
	Model 3	0.0622	0.00085	7	2

Match prevalence	Model	BS		Interactions selected
Match prevalence	Model	Mean	SD	Total	Correct
0.01	FS	0.0062	0.00017
	Model 1	0.0008	0.00005	1	1
	Model 2	0.0048	0.00014	2	1
	Model 3	0.0091	0.00078	8	3
0.05	FS	0.0012	0.00004
	Model 1	0.0011	0.00097	1	1
	Model 2	0.0013	0.00004	4	3
	Model 3	0.0003	0.00002	7	6
0.10	FS	0.0016	0.00004
	Model 1	0.0011	0.00003	1	1
	Model 2	0.0013	0.00006	4	3
	Model 3	0.0003	0.00012	7	6
0.30	FS	0.0029	0.00006
	Model 1	0.0070	0.00011	2	1
	Model 2	0.0040	0.00008	4	3
	Model 3	0.0016	0.00007	7	6
0.50	FS	0.0042	0.00007
	Model 1	0.0060	0.00051	3	2
	Model 2	0.0046	0.00019	5	4
	Model 3	0.0002	0.00012	7	6
0.70	FS	0.0058	0.00009
	Model 1	0.0071	0.00010	3	2
	Model 2	0.0027	0.00169	5	4
	Model 3	0.0003	0.00011	6	5
0.90	FS	0.0089	0.00012
	Model 1	0.0091	0.00013	3	2
	Model 2	0.0082	0.00018	4	3
	Model 3	0.0008	0.00010	6	5
0.95	FS	0.0255	0.00035
	Model 1	0.0278	0.00029	2	1
	Model 2	0.0182	0.00026	5	2
	Model 3	0.0105	0.00985	10	5
0.99	FS	0.0656	0.00043
	Model 1	0.0683	0.00041	1	0
	Model 2	0.0683	0.00041	1	0
	Model 3	0.0622	0.00085	7	2

TABLE 2

Mean and SD of the difference in Brier score (BS) for Fellegi–Sunter (FS) model and conditional dependence latent class models (Model 1: pseudo- $R^{2} \geq$ 1%; Model 2: pseudo- $R^{2} \geq$ 0.5%; Model 3: pseudo- $R^{2} \geq$ 0.1%;) relative to the true model and median number of interactions selected in Scenario IV

Match prevalence	Model	BS		Interactions selected
Match prevalence	Model	Mean	SD	Total	Correct
0.01	FS	0.0062	0.00017
	Model 1	0.0008	0.00005	1	1
	Model 2	0.0048	0.00014	2	1
	Model 3	0.0091	0.00078	8	3
0.05	FS	0.0012	0.00004
	Model 1	0.0011	0.00097	1	1
	Model 2	0.0013	0.00004	4	3
	Model 3	0.0003	0.00002	7	6
0.10	FS	0.0016	0.00004
	Model 1	0.0011	0.00003	1	1
	Model 2	0.0013	0.00006	4	3
	Model 3	0.0003	0.00012	7	6
0.30	FS	0.0029	0.00006
	Model 1	0.0070	0.00011	2	1
	Model 2	0.0040	0.00008	4	3
	Model 3	0.0016	0.00007	7	6
0.50	FS	0.0042	0.00007
	Model 1	0.0060	0.00051	3	2
	Model 2	0.0046	0.00019	5	4
	Model 3	0.0002	0.00012	7	6
0.70	FS	0.0058	0.00009
	Model 1	0.0071	0.00010	3	2
	Model 2	0.0027	0.00169	5	4
	Model 3	0.0003	0.00011	6	5
0.90	FS	0.0089	0.00012
	Model 1	0.0091	0.00013	3	2
	Model 2	0.0082	0.00018	4	3
	Model 3	0.0008	0.00010	6	5
0.95	FS	0.0255	0.00035
	Model 1	0.0278	0.00029	2	1
	Model 2	0.0182	0.00026	5	2
	Model 3	0.0105	0.00985	10	5
0.99	FS	0.0656	0.00043
	Model 1	0.0683	0.00041	1	0
	Model 2	0.0683	0.00041	1	0
	Model 3	0.0622	0.00085	7	2

Match prevalence	Model	BS		Interactions selected
Match prevalence	Model	Mean	SD	Total	Correct
0.01	FS	0.0062	0.00017
	Model 1	0.0008	0.00005	1	1
	Model 2	0.0048	0.00014	2	1
	Model 3	0.0091	0.00078	8	3
0.05	FS	0.0012	0.00004
	Model 1	0.0011	0.00097	1	1
	Model 2	0.0013	0.00004	4	3
	Model 3	0.0003	0.00002	7	6
0.10	FS	0.0016	0.00004
	Model 1	0.0011	0.00003	1	1
	Model 2	0.0013	0.00006	4	3
	Model 3	0.0003	0.00012	7	6
0.30	FS	0.0029	0.00006
	Model 1	0.0070	0.00011	2	1
	Model 2	0.0040	0.00008	4	3
	Model 3	0.0016	0.00007	7	6
0.50	FS	0.0042	0.00007
	Model 1	0.0060	0.00051	3	2
	Model 2	0.0046	0.00019	5	4
	Model 3	0.0002	0.00012	7	6
0.70	FS	0.0058	0.00009
	Model 1	0.0071	0.00010	3	2
	Model 2	0.0027	0.00169	5	4
	Model 3	0.0003	0.00011	6	5
0.90	FS	0.0089	0.00012
	Model 1	0.0091	0.00013	3	2
	Model 2	0.0082	0.00018	4	3
	Model 3	0.0008	0.00010	6	5
0.95	FS	0.0255	0.00035
	Model 1	0.0278	0.00029	2	1
	Model 2	0.0182	0.00026	5	2
	Model 3	0.0105	0.00985	10	5
0.99	FS	0.0656	0.00043
	Model 1	0.0683	0.00041	1	0
	Model 2	0.0683	0.00041	1	0
	Model 3	0.0622	0.00085	7	2

In Scenario I with conditional dependence in both match and non-match classes, the FS model performs reasonably well when the true match prevalence is between 0.3 and 0.95. On average, the BS for the FS model is only slightly higher than that of the true model. Among the three conditional dependence latent class models, match performance is the worst for Model 1 that only includes interactions with pseudo- $R^{2} > 1 %$ but improves substantially as more interactions are selected. When interactions with pseudo- $R^{2} > 0.1 %$ are selected, Model 3 performs better than the FS model and is only negligibly worse than the true model. The performance of the conditional dependence latent class models can be explained by the conditional dependence structure that is identified. Model 1 generally selects two interactions to include in the model, but only one is selected correctly. Model 2 identifies four to five interactions, which include three to four important interactions. Therefore both models identify an incorrect conditional dependence structure. Model 3, on the other hand, selects all six important interactions and therefore identifies the correct conditional dependence structure. When the true match prevalence approaches zero or one (⁠ $< = 0.1$ or $= 0.99$ ⁠), the FS model performs poorly with a much larger BS than the true model. The score test cannot identify the correct conditional dependence structure regardless of what threshold is used for the pseudo- $R^{2}$ ⁠. Consequently, all conditional dependence latent class models performs worse than the true model, although Model 3 usually has a comparable or better performance than the FS model.

In Scenario II where the conditional dependence exists only in the match class, all models have similar match performance when the true match prevalence is lower than 0.5. As the true match prevalence increases, the effect of the conditional dependence becomes more prominent. This can be seen by the smaller BS of the Model 3 relative to the FS model, although the FS model generally performs well. Again, we observe that Model 3 is able to identify the correct conditional dependence structure except when the true match prevalence reaches 0.99. With the extremely large match prevalence of 0.99, all models perform poorly with the FS model and Model 1 having the worst performance. Their BS are 0.4 points greater than the true model on average. Although Model 2 and Model 3 do not have satisfactory performance, they perform relatively better with a smaller BS. The poor performance of the models can be explained by the incorrect conditional dependence structure identified.

Findings in Scenario III where conditional dependence only exists in the non-match class are similar to those in Scenario II. All models perform similarly well when the true match prevalence is large. As the true match prevalence goes down to 0.05 or 0.01 with vast majority of the record pairs being non-matches, ignoring the conditional dependence results in a poorly performing FS model. However, due to the imbalance of the match and non-match classes, identification of the correct conditional dependence structure is challenging. None of the six interactions is correctly identified in the conditional dependence latent class models. Consequently, these models show a poor performance.

In Scenario IV with highly discriminating fields and conditional dependence in both match and non-match classes, we see similar results as in Scenario I, except that the BS is consistently smaller. This is expected since matching fields have greater discriminating power. Using Model 3, the conditional dependence structure is identified correctly even when the match prevalence is as low as 0.05. When the true match prevalence is 0.01, Model 3 selects eight interactions, of which three were identified correctly. This shows that the proposed score test is more powerful when highly discriminating fields are used for linkage since only one incorrect interaction is identified in Scenario I. In addition, all models perform similarly well when the true match prevalence is small. On the other hand, when the true match prevalence is extremely large at 0.95 or 0.99, the performance of the FS model suffers, while incorporating the conditional dependence identified by the proposed method yields comparable or better performance.

4 NEWBORN SCREENING DATA DEDUPLICATION

We now apply the proposed method to a real-world linkage example to identify conditional dependence and evaluate the performance of the conditional dependence latent class model relative to the FS model. In this application, a total of 765,814 Health Level 7 (HL7) messages sent to the state public health programme for newborn screening (NBS) results of patients less than 2 months of age in 2017 were extracted from a local Health Information Exchange (HIE) and deduplicated. Fields available for linkage include medical record number (MRN), patient's first name (FN), middle initial (MI), and last name (LN), sex, telephone number (TEL), street address (ADR), zip code (ZIP), date of birth (DOB), and next of kin's first name (NKFN) and last name (NKLN). With more than 300 billion pairs of records to form and compare, we applied five blocking schemes to reduce the number of comparisons: MRN, LN and FN (LN-FN), date of birth and zip code (DOB-ZIP), NKLN and first name (NKLN-NKFN) and TEL. These five blocking schemes contained 9.6 million record pairs, from which a random sample of 15,000 record pairs were selected and manually reviewed. A total of 7967 (53.1%) record pairs in the manual review sample were found to be true matches.

The proposed method is applied to the five blocking schemes separately. As in the simulation study, three conditional dependence latent class models with increasing complexity are considered in addition to the FS model. Thresholds for the pseudo- $R^{2}$ are 1% for Model 1, 0.5% for Model 2, and 0.1% for Model 3. The BS is calculated based on the manual review sample to evaluate the match performance. Since the goal of record linkage is to classify record pairs as matches or non-matches, we also provide the F-score for each model. The F-score is the harmonic mean of sensitivity and positive predictive value (PPV) and therefore requires dichotomizing the model predicted probabilities ${\hat{p}}_{i}$ 's. The threshold for dichotomization is selected so that the proportion of record pairs with model predicted probabilities above the threshold is equal to the model estimated match prevalence.

4.1 The MRN blocking scheme

The MRN blocking scheme captures roughly 4.2 million record pairs. Seven matching fields are used in latent class models: patient's LN, FN, MI, sex, TEL, ADR and ZIP. Date of birth is not used due to the high agreement rate of 99.7%. For each of the 21 pairs of matching fields, score test statistics are calculated for the three types of conditional dependence structures (Table 3). Conditional dependence between ADR and ZIP is the strongest, revealing the largest improvement in model fit with a pseudo- $R^{2}$ of 2.53% when allowing a constant conditional dependency parameter. This is followed by the conditional dependence between LN and MI, which has a pseudo- $R^{2}$ of 1.39% when incorporating a conditional dependency parameter in the match class only. These two interactions are included in Model 1. In Model 2, conditional dependence with pseudo- $R^{2}$ of 0.5% or above is considered so two additional parameters including the constant conditional dependency parameters for MI and ADR and for MI and ZIP. Model 3 includes conditional dependence with a pseudo- $R^{2}$ of 0.1% or above so this is the most complicated model with 12 conditional dependency parameters.

TABLE 3

Score test statistics for the medical record number (MRN) blocking scheme of the newborn screening data deduplication

Pair number	Field 1	Field 2	Score test statistics			Maximum statistic	Pseudo- $R^{2}$
Pair number	Field 1	Field 2	Constant dependency	Non-match only	Match only	Maximum statistic	Pseudo- $R^{2}$
1	ADR	ZIP	213,890.2	212,906.5	183,954.8	213,890.2	2.53%
2	LN	MI	98,555.9	78,449.0	117,313.5	117,313.5	1.39%
3	MI	ADR	46,212.1	45,591.7	27,248.9	46,212.1	0.55%
4	MI	ZIP	42,652.3	42,060.1	25,456.7	42,652.3	0.50%
5	LN	ZIP	41,103.2	40,399.7	17,641.7	41,103.2	0.49%
6	LN	ADR	39,428.5	38,686.2	18,036.9	39,428.5	0.47%
7	LN	FN	16,546.0	15,678.4	8932.9	16,546.0	0.20%
8	FN	MI	12,568.8	12,365.7	5110.7	12,568.8	0.15%
9	TEL	ADR	10,189.3	10,803.7	57.6	10,803.7	0.13%
10	FN	ZIP	9778.1	9685.3	7392.1	9778.1	0.12%
11	FN	ADR	9198.4	9109.9	7215.6	9198.4	0.11%
12	TEL	ZIP	7399.3	8419.8	3811.5	8419.8	0.10%
13	LN	TEL	7230.3	6228.0	3700.6	7230.3	0.09%
14	FN	TEL	1172.5	988.8	1471.2	1471.2	0.02%
15	MI	TEL	1211.1	1001.0	861.5	1211.1	0.01%
16	FN	SEX	1198.0	1154.6	475.7	1198.0	0.01%
17	SEX	ZIP	9.7	12.6	98.5	98.5	0.00%
18	SEX	ADR	9.7	12.6	96.5	96.5	0.00%
19	SEX	MI	86.6	89.7	7.7	89.7	0.00%
20	LN	SEX	28.6	64.3	40.4	64.3	0.00%
21	SEX	TEL	0.0	0.2	3.1	3.1	0.00%

Pair number	Field 1	Field 2	Score test statistics			Maximum statistic	Pseudo- $R^{2}$
Pair number	Field 1	Field 2	Constant dependency	Non-match only	Match only	Maximum statistic	Pseudo- $R^{2}$
1	ADR	ZIP	213,890.2	212,906.5	183,954.8	213,890.2	2.53%
2	LN	MI	98,555.9	78,449.0	117,313.5	117,313.5	1.39%
3	MI	ADR	46,212.1	45,591.7	27,248.9	46,212.1	0.55%
4	MI	ZIP	42,652.3	42,060.1	25,456.7	42,652.3	0.50%
5	LN	ZIP	41,103.2	40,399.7	17,641.7	41,103.2	0.49%
6	LN	ADR	39,428.5	38,686.2	18,036.9	39,428.5	0.47%
7	LN	FN	16,546.0	15,678.4	8932.9	16,546.0	0.20%
8	FN	MI	12,568.8	12,365.7	5110.7	12,568.8	0.15%
9	TEL	ADR	10,189.3	10,803.7	57.6	10,803.7	0.13%
10	FN	ZIP	9778.1	9685.3	7392.1	9778.1	0.12%
11	FN	ADR	9198.4	9109.9	7215.6	9198.4	0.11%
12	TEL	ZIP	7399.3	8419.8	3811.5	8419.8	0.10%
13	LN	TEL	7230.3	6228.0	3700.6	7230.3	0.09%
14	FN	TEL	1172.5	988.8	1471.2	1471.2	0.02%
15	MI	TEL	1211.1	1001.0	861.5	1211.1	0.01%
16	FN	SEX	1198.0	1154.6	475.7	1198.0	0.01%
17	SEX	ZIP	9.7	12.6	98.5	98.5	0.00%
18	SEX	ADR	9.7	12.6	96.5	96.5	0.00%
19	SEX	MI	86.6	89.7	7.7	89.7	0.00%
20	LN	SEX	28.6	64.3	40.4	64.3	0.00%
21	SEX	TEL	0.0	0.2	3.1	3.1	0.00%

TABLE 3

Score test statistics for the medical record number (MRN) blocking scheme of the newborn screening data deduplication

Pair number	Field 1	Field 2	Score test statistics			Maximum statistic	Pseudo- $R^{2}$
Pair number	Field 1	Field 2	Constant dependency	Non-match only	Match only	Maximum statistic	Pseudo- $R^{2}$
1	ADR	ZIP	213,890.2	212,906.5	183,954.8	213,890.2	2.53%
2	LN	MI	98,555.9	78,449.0	117,313.5	117,313.5	1.39%
3	MI	ADR	46,212.1	45,591.7	27,248.9	46,212.1	0.55%
4	MI	ZIP	42,652.3	42,060.1	25,456.7	42,652.3	0.50%
5	LN	ZIP	41,103.2	40,399.7	17,641.7	41,103.2	0.49%
6	LN	ADR	39,428.5	38,686.2	18,036.9	39,428.5	0.47%
7	LN	FN	16,546.0	15,678.4	8932.9	16,546.0	0.20%
8	FN	MI	12,568.8	12,365.7	5110.7	12,568.8	0.15%
9	TEL	ADR	10,189.3	10,803.7	57.6	10,803.7	0.13%
10	FN	ZIP	9778.1	9685.3	7392.1	9778.1	0.12%
11	FN	ADR	9198.4	9109.9	7215.6	9198.4	0.11%
12	TEL	ZIP	7399.3	8419.8	3811.5	8419.8	0.10%
13	LN	TEL	7230.3	6228.0	3700.6	7230.3	0.09%
14	FN	TEL	1172.5	988.8	1471.2	1471.2	0.02%
15	MI	TEL	1211.1	1001.0	861.5	1211.1	0.01%
16	FN	SEX	1198.0	1154.6	475.7	1198.0	0.01%
17	SEX	ZIP	9.7	12.6	98.5	98.5	0.00%
18	SEX	ADR	9.7	12.6	96.5	96.5	0.00%
19	SEX	MI	86.6	89.7	7.7	89.7	0.00%
20	LN	SEX	28.6	64.3	40.4	64.3	0.00%
21	SEX	TEL	0.0	0.2	3.1	3.1	0.00%

Pair number	Field 1	Field 2	Score test statistics			Maximum statistic	Pseudo- $R^{2}$
Pair number	Field 1	Field 2	Constant dependency	Non-match only	Match only	Maximum statistic	Pseudo- $R^{2}$
1	ADR	ZIP	213,890.2	212,906.5	183,954.8	213,890.2	2.53%
2	LN	MI	98,555.9	78,449.0	117,313.5	117,313.5	1.39%
3	MI	ADR	46,212.1	45,591.7	27,248.9	46,212.1	0.55%
4	MI	ZIP	42,652.3	42,060.1	25,456.7	42,652.3	0.50%
5	LN	ZIP	41,103.2	40,399.7	17,641.7	41,103.2	0.49%
6	LN	ADR	39,428.5	38,686.2	18,036.9	39,428.5	0.47%
7	LN	FN	16,546.0	15,678.4	8932.9	16,546.0	0.20%
8	FN	MI	12,568.8	12,365.7	5110.7	12,568.8	0.15%
9	TEL	ADR	10,189.3	10,803.7	57.6	10,803.7	0.13%
10	FN	ZIP	9778.1	9685.3	7392.1	9778.1	0.12%
11	FN	ADR	9198.4	9109.9	7215.6	9198.4	0.11%
12	TEL	ZIP	7399.3	8419.8	3811.5	8419.8	0.10%
13	LN	TEL	7230.3	6228.0	3700.6	7230.3	0.09%
14	FN	TEL	1172.5	988.8	1471.2	1471.2	0.02%
15	MI	TEL	1211.1	1001.0	861.5	1211.1	0.01%
16	FN	SEX	1198.0	1154.6	475.7	1198.0	0.01%
17	SEX	ZIP	9.7	12.6	98.5	98.5	0.00%
18	SEX	ADR	9.7	12.6	96.5	96.5	0.00%
19	SEX	MI	86.6	89.7	7.7	89.7	0.00%
20	LN	SEX	28.6	64.3	40.4	64.3	0.00%
21	SEX	TEL	0.0	0.2	3.1	3.1	0.00%

The match performance of the models is shown in Table 4. Of the 15,000 randomly selected and reviewed record pairs, 6487 pairs are in the MRN blocking scheme and 6000 (92.5%) are true matches. The BS is 0.054 for Model 1, 0.056 for Model 2, and 0.080 for Model 3, all smaller than the BS of 0.086 for the FS model. This shows that the conditional dependence models perform better than the FS model. Dichotomising the estimated match probabilities using the estimated match prevalence, the F-scores show a similar pattern. All conditional dependence models achieve better F-scores than the FS model, with greater than 2% improvement for Model 1 and Model 2 and 1% improvement for Model 3.

TABLE 4

Match performance of the latent class models for the newborn screening data deduplication

Model	Brier score	True negative	True positive	False negative	False positive	Sensitivity	Positive predictive value	F-score
MRN blocking scheme
FS	0.086	466	5248	752	21	0.875	0.996	0.931
Model 1	0.054	457	5524	476	30	0.921	0.995	0.956
Model 2	0.056	453	5521	479	34	0.920	0.994	0.956
Model 3	0.080	460	5357	643	27	0.893	0.995	0.941
LN-FN blocking scheme
FS	0.149	108	3732	860	0	0.813	1.000	0.897
Model 1	0.128	108	3823	769	0	0.833	1.000	0.909
Model 2	0.121	108	3861	731	0	0.841	1.000	0.914
Model 3	0.139	107	3708	884	1	0.807	1.000	0.893
DOB-ZIP blocking scheme
FS	0.071	5711	5889	249	794	0.959	0.881	0.919
Model 1	0.062	5777	5996	142	728	0.977	0.892	0.932
Model 2	0.068	5651	6081	57	854	0.991	0.877	0.930
Model 3	0.072	5587	6090	48	918	0.992	0.869	0.927
TEL blocking scheme
FS	0.138	440	2960	515	223	0.852	0.930	0.889
Model 1	0.134	313	3143	332	350	0.904	0.900	0.902
Model 2	0.134	313	3143	332	350	0.904	0.900	0.902
Model 3	0.145	310	3121	354	353	0.898	0.898	0.898
NKLN-NKFN blocking scheme
FS	0.153	167	1431	26	280	0.982	0.836	0.903
Model 1	0.159	143	1454	3	304	0.998	0.827	0.905
Model 2	0.160	140	1455	2	307	0.999	0.826	0.904
Model 3	0.114	248	1419	38	199	0.974	0.877	0.923
All blocking schemes combined
FS	0.100	6168	7010	957	865	0.880	0.890	0.885
Model 1	0.085	6167	7266	701	866	0.912	0.894	0.903
Model 2	0.089	6062	7343	624	971	0.922	0.883	0.902
Model 3	0.096	6019	7259	708	1014	0.911	0.877	0.894

Model	Brier score	True negative	True positive	False negative	False positive	Sensitivity	Positive predictive value	F-score
MRN blocking scheme
FS	0.086	466	5248	752	21	0.875	0.996	0.931
Model 1	0.054	457	5524	476	30	0.921	0.995	0.956
Model 2	0.056	453	5521	479	34	0.920	0.994	0.956
Model 3	0.080	460	5357	643	27	0.893	0.995	0.941
LN-FN blocking scheme
FS	0.149	108	3732	860	0	0.813	1.000	0.897
Model 1	0.128	108	3823	769	0	0.833	1.000	0.909
Model 2	0.121	108	3861	731	0	0.841	1.000	0.914
Model 3	0.139	107	3708	884	1	0.807	1.000	0.893
DOB-ZIP blocking scheme
FS	0.071	5711	5889	249	794	0.959	0.881	0.919
Model 1	0.062	5777	5996	142	728	0.977	0.892	0.932
Model 2	0.068	5651	6081	57	854	0.991	0.877	0.930
Model 3	0.072	5587	6090	48	918	0.992	0.869	0.927
TEL blocking scheme
FS	0.138	440	2960	515	223	0.852	0.930	0.889
Model 1	0.134	313	3143	332	350	0.904	0.900	0.902
Model 2	0.134	313	3143	332	350	0.904	0.900	0.902
Model 3	0.145	310	3121	354	353	0.898	0.898	0.898
NKLN-NKFN blocking scheme
FS	0.153	167	1431	26	280	0.982	0.836	0.903
Model 1	0.159	143	1454	3	304	0.998	0.827	0.905
Model 2	0.160	140	1455	2	307	0.999	0.826	0.904
Model 3	0.114	248	1419	38	199	0.974	0.877	0.923
All blocking schemes combined
FS	0.100	6168	7010	957	865	0.880	0.890	0.885
Model 1	0.085	6167	7266	701	866	0.912	0.894	0.903
Model 2	0.089	6062	7343	624	971	0.922	0.883	0.902
Model 3	0.096	6019	7259	708	1014	0.911	0.877	0.894

TABLE 4

Match performance of the latent class models for the newborn screening data deduplication

Model	Brier score	True negative	True positive	False negative	False positive	Sensitivity	Positive predictive value	F-score
MRN blocking scheme
FS	0.086	466	5248	752	21	0.875	0.996	0.931
Model 1	0.054	457	5524	476	30	0.921	0.995	0.956
Model 2	0.056	453	5521	479	34	0.920	0.994	0.956
Model 3	0.080	460	5357	643	27	0.893	0.995	0.941
LN-FN blocking scheme
FS	0.149	108	3732	860	0	0.813	1.000	0.897
Model 1	0.128	108	3823	769	0	0.833	1.000	0.909
Model 2	0.121	108	3861	731	0	0.841	1.000	0.914
Model 3	0.139	107	3708	884	1	0.807	1.000	0.893
DOB-ZIP blocking scheme
FS	0.071	5711	5889	249	794	0.959	0.881	0.919
Model 1	0.062	5777	5996	142	728	0.977	0.892	0.932
Model 2	0.068	5651	6081	57	854	0.991	0.877	0.930
Model 3	0.072	5587	6090	48	918	0.992	0.869	0.927
TEL blocking scheme
FS	0.138	440	2960	515	223	0.852	0.930	0.889
Model 1	0.134	313	3143	332	350	0.904	0.900	0.902
Model 2	0.134	313	3143	332	350	0.904	0.900	0.902
Model 3	0.145	310	3121	354	353	0.898	0.898	0.898
NKLN-NKFN blocking scheme
FS	0.153	167	1431	26	280	0.982	0.836	0.903
Model 1	0.159	143	1454	3	304	0.998	0.827	0.905
Model 2	0.160	140	1455	2	307	0.999	0.826	0.904
Model 3	0.114	248	1419	38	199	0.974	0.877	0.923
All blocking schemes combined
FS	0.100	6168	7010	957	865	0.880	0.890	0.885
Model 1	0.085	6167	7266	701	866	0.912	0.894	0.903
Model 2	0.089	6062	7343	624	971	0.922	0.883	0.902
Model 3	0.096	6019	7259	708	1014	0.911	0.877	0.894

Model	Brier score	True negative	True positive	False negative	False positive	Sensitivity	Positive predictive value	F-score
MRN blocking scheme
FS	0.086	466	5248	752	21	0.875	0.996	0.931
Model 1	0.054	457	5524	476	30	0.921	0.995	0.956
Model 2	0.056	453	5521	479	34	0.920	0.994	0.956
Model 3	0.080	460	5357	643	27	0.893	0.995	0.941
LN-FN blocking scheme
FS	0.149	108	3732	860	0	0.813	1.000	0.897
Model 1	0.128	108	3823	769	0	0.833	1.000	0.909
Model 2	0.121	108	3861	731	0	0.841	1.000	0.914
Model 3	0.139	107	3708	884	1	0.807	1.000	0.893
DOB-ZIP blocking scheme
FS	0.071	5711	5889	249	794	0.959	0.881	0.919
Model 1	0.062	5777	5996	142	728	0.977	0.892	0.932
Model 2	0.068	5651	6081	57	854	0.991	0.877	0.930
Model 3	0.072	5587	6090	48	918	0.992	0.869	0.927
TEL blocking scheme
FS	0.138	440	2960	515	223	0.852	0.930	0.889
Model 1	0.134	313	3143	332	350	0.904	0.900	0.902
Model 2	0.134	313	3143	332	350	0.904	0.900	0.902
Model 3	0.145	310	3121	354	353	0.898	0.898	0.898
NKLN-NKFN blocking scheme
FS	0.153	167	1431	26	280	0.982	0.836	0.903
Model 1	0.159	143	1454	3	304	0.998	0.827	0.905
Model 2	0.160	140	1455	2	307	0.999	0.826	0.904
Model 3	0.114	248	1419	38	199	0.974	0.877	0.923
All blocking schemes combined
FS	0.100	6168	7010	957	865	0.880	0.890	0.885
Model 1	0.085	6167	7266	701	866	0.912	0.894	0.903
Model 2	0.089	6062	7343	624	971	0.922	0.883	0.902
Model 3	0.096	6019	7259	708	1014	0.911	0.877	0.894

4.2 The LN-FN blocking scheme

The LN-FN blocking scheme contains roughly 3 million record pairs. Six matching fields are used: MRN, MI, DOB, TEL, ADR and ZIP. Sex is not used as a matching field in the models due to its high agreement rate of 99.5%. Table 5 shows the score test statistics for each of the 15 pairs of matching fields. ADR and ZIP again show the strongest conditional dependence with a pseudo- $R^{2}$ of 1.16%. Model 1 therefore includes one conditional dependency parameter for the interaction between two fields for the match class only. Two additional interactions are included in Model 2: One between date of birth and zip code in the non-match class only and the other between MRN and MI with a constant conditional dependency parameter. Both interactions are associated with a pseudo- $R^{2}$ above 0.5%. Model 3 includes 10 interactions whose pseudo- $R^{2}$ is above 0.1%.

TABLE 5

Score test statistics for the last name and first name (LN-FN) blocking scheme of the newborn screening data deduplication

Pair number	Field 1	Field 2	Score test statistics			Maximum statistic	Pseudo- $R^{2}$
Pair number	Field 1	Field 2	Constant dependency	Non-match only	Match only	Maximum statistic	Pseudo- $R^{2}$
1	ADR	ZIP	6655.6	4937.7	91,699.4	91,699.4	1.16%
2	DOB	ZIP	70,376.4	70,377.4	468.9	70,377.4	0.89%
3	MRN	MI	43,615.1	37,516.7	14,816.6	43,615.1	0.55%
4	DOB	TEL	32,616.9	32,618.2	230.2	32,618.2	0.41%
5	DOB	ADR	20,311.7	20,312.2	1748.9	20,312.2	0.26%
6	MRN	ZIP	16,383.4	18,327.0	156.9	18,327.0	0.23%
7	MI	DOB	17,258.3	17,258.6	90.9	17,258.6	0.22%
8	MRN	DOB	10,463.1	10,464.9	2753.3	10,464.9	0.13%
9	MRN	ADR	2108.3	4003.5	10,006.0	10,006.0	0.13%
10	MRN	TEL	7617.4	7172.4	1002.3	7617.4	0.10%
11	MI	ADR	3604.9	3385.9	3427.4	3604.9	0.05%
12	TEL	ZIP	2988.7	3419.4	670.6	3419.4	0.04%
13	TEL	ADR	1002.2	1108.5	135.1	1108.5	0.01%
14	MI	TEL	836.4	1075.8	1006.7	1075.8	0.01%
15	MI	ZIP	49.7	49.9	0.2	49.9	0.00%

Pair number	Field 1	Field 2	Score test statistics			Maximum statistic	Pseudo- $R^{2}$
Pair number	Field 1	Field 2	Constant dependency	Non-match only	Match only	Maximum statistic	Pseudo- $R^{2}$
1	ADR	ZIP	6655.6	4937.7	91,699.4	91,699.4	1.16%
2	DOB	ZIP	70,376.4	70,377.4	468.9	70,377.4	0.89%
3	MRN	MI	43,615.1	37,516.7	14,816.6	43,615.1	0.55%
4	DOB	TEL	32,616.9	32,618.2	230.2	32,618.2	0.41%
5	DOB	ADR	20,311.7	20,312.2	1748.9	20,312.2	0.26%
6	MRN	ZIP	16,383.4	18,327.0	156.9	18,327.0	0.23%
7	MI	DOB	17,258.3	17,258.6	90.9	17,258.6	0.22%
8	MRN	DOB	10,463.1	10,464.9	2753.3	10,464.9	0.13%
9	MRN	ADR	2108.3	4003.5	10,006.0	10,006.0	0.13%
10	MRN	TEL	7617.4	7172.4	1002.3	7617.4	0.10%
11	MI	ADR	3604.9	3385.9	3427.4	3604.9	0.05%
12	TEL	ZIP	2988.7	3419.4	670.6	3419.4	0.04%
13	TEL	ADR	1002.2	1108.5	135.1	1108.5	0.01%
14	MI	TEL	836.4	1075.8	1006.7	1075.8	0.01%
15	MI	ZIP	49.7	49.9	0.2	49.9	0.00%

TABLE 5

Score test statistics for the last name and first name (LN-FN) blocking scheme of the newborn screening data deduplication

Pair number	Field 1	Field 2	Score test statistics			Maximum statistic	Pseudo- $R^{2}$
Pair number	Field 1	Field 2	Constant dependency	Non-match only	Match only	Maximum statistic	Pseudo- $R^{2}$
1	ADR	ZIP	6655.6	4937.7	91,699.4	91,699.4	1.16%
2	DOB	ZIP	70,376.4	70,377.4	468.9	70,377.4	0.89%
3	MRN	MI	43,615.1	37,516.7	14,816.6	43,615.1	0.55%
4	DOB	TEL	32,616.9	32,618.2	230.2	32,618.2	0.41%
5	DOB	ADR	20,311.7	20,312.2	1748.9	20,312.2	0.26%
6	MRN	ZIP	16,383.4	18,327.0	156.9	18,327.0	0.23%
7	MI	DOB	17,258.3	17,258.6	90.9	17,258.6	0.22%
8	MRN	DOB	10,463.1	10,464.9	2753.3	10,464.9	0.13%
9	MRN	ADR	2108.3	4003.5	10,006.0	10,006.0	0.13%
10	MRN	TEL	7617.4	7172.4	1002.3	7617.4	0.10%
11	MI	ADR	3604.9	3385.9	3427.4	3604.9	0.05%
12	TEL	ZIP	2988.7	3419.4	670.6	3419.4	0.04%
13	TEL	ADR	1002.2	1108.5	135.1	1108.5	0.01%
14	MI	TEL	836.4	1075.8	1006.7	1075.8	0.01%
15	MI	ZIP	49.7	49.9	0.2	49.9	0.00%

Pair number	Field 1	Field 2	Score test statistics			Maximum statistic	Pseudo- $R^{2}$
Pair number	Field 1	Field 2	Constant dependency	Non-match only	Match only	Maximum statistic	Pseudo- $R^{2}$
1	ADR	ZIP	6655.6	4937.7	91,699.4	91,699.4	1.16%
2	DOB	ZIP	70,376.4	70,377.4	468.9	70,377.4	0.89%
3	MRN	MI	43,615.1	37,516.7	14,816.6	43,615.1	0.55%
4	DOB	TEL	32,616.9	32,618.2	230.2	32,618.2	0.41%
5	DOB	ADR	20,311.7	20,312.2	1748.9	20,312.2	0.26%
6	MRN	ZIP	16,383.4	18,327.0	156.9	18,327.0	0.23%
7	MI	DOB	17,258.3	17,258.6	90.9	17,258.6	0.22%
8	MRN	DOB	10,463.1	10,464.9	2753.3	10,464.9	0.13%
9	MRN	ADR	2108.3	4003.5	10,006.0	10,006.0	0.13%
10	MRN	TEL	7617.4	7172.4	1002.3	7617.4	0.10%
11	MI	ADR	3604.9	3385.9	3427.4	3604.9	0.05%
12	TEL	ZIP	2988.7	3419.4	670.6	3419.4	0.04%
13	TEL	ADR	1002.2	1108.5	135.1	1108.5	0.01%
14	MI	TEL	836.4	1075.8	1006.7	1075.8	0.01%
15	MI	ZIP	49.7	49.9	0.2	49.9	0.00%

A total of 4700 record pairs in the manual review sample are in the LN-FN blocking scheme, of which 4592 (97.7%) are true matches. The BS is 0.128 for Model 1, 0.121 for Model 2 and 0.139 for Model 3. As in the MRN blocking scheme, all conditional dependence models perform better than the FS model, which has a BS of 0.149. Dichotomising the match probabilities using the estimated match prevalence, Model 2 has the greatest F-score with a 1.7% improvement compared to the FS model.

4.3 Other blocking schemes

In the other three blocking schemes, there are approximately 8 million (DOB-ZIP), 2.6 million (TEL) and 1.2 million (NKLN-NKFN) record pairs. The score test statistics are shown in the Appendix (Tables A1,A2,A3). In the DOB-ZIP blocking scheme, the BS is 0.062 for Model 1, 0.068 for Model 2 and 0.072 for Model 3, showing comparable or better performance than the FS model (BS = 0.071). In the TEL blocking scheme, Model 1 and Model 2 include the same interaction terms and achieve a BS of 0.134, yielding a better performance than the FS model. Model 3 has a BS of 0.145, which is slightly larger than that of the FS model. In the NKLN-NKFN blocking scheme, the BS of Model 1 and Model 2 is larger than that of the FS model. Model 3, however, produces a BS of 0.114, smaller than the BS of 0.153 for the FS model.

4.4 Overall results

In record linkage, results across blocking schemes are usually combined to classify a record pair as a match if it is determined to be a match in at least one blocking scheme. We use the same decision rule when evaluating the overall F-score. For the calculation of the overall BS, we use the maximum estimated match probability for any record pair that falls in multiple blocking schemes. The results are shown Table 4. The BS is 0.085 for Model 1, 0.089 for Model 2, and 0.096 for Model 3. Compared to the FS model with a BS of 0.1, accommodating the conditional dependence produces improvement in match performance. Similar results can be seen in F-scores.

5 DISCUSSION

The FS model is widely used in probabilistic record linkage despite its often invalid assumption of conditional independence. Prior research has demonstrated its impaired performance when conditional dependence exists, as well as the potential gain in matching accuracy when conditional dependence latent class models are used (Xu et al., 2019). However, the success of the conditional dependence models is heavily dependent on the use of correct conditional dependence structure (Li et al., 2018). Existing approaches for the identification of the conditional dependence structure, including the correlation residual plot, the log-odds ratio check, and the bivariate residual approach, have been shown to have poor performance (Oberski et al., 2013; Subtil et al., 2012). Alternatively, Oberski et al. (2013) proposed the bootstrap bivariate residual approach and the score test approach, both of which produce adequate performance, with the score test showing greater appeal due to the computational convenience. In this paper, we extend Oberski et al. (2013)'s score test approach to accommodate more dependence structures. Through the simulation study and the real-world linkage application, the proposed approach is found to be successful. Based on the findings in the simulation study, we recommend to use a threshold of 0.1% for the pseudo- $R^{2}$ and include all interactions meeting the threshold in the conditional dependence model. This model is shown to correctly identify the conditional dependence structure in many settings, resulting in comparable or better match performance than the FS model.

The proposed score-based tests handle three types of conditional dependence structure. This test can be easily extended to evaluate the conditional dependence in the match and non-match classes simultaneously while allowing the conditional dependency parameters to be different between classes. The two-dimensional score vector is formed by the scores in (8) and (9) where conditional dependence lies only in one class, with the Fisher Information matrix similarly derived. Although this model accommodates a more flexible conditional dependence structure, it does not provide substantially improved matching performance relative to the proposed method in our simulation study as it identifies a similar number of correct interactions. For example, in Scenario I of the simulation study with 0.01 match prevalence, this score test identifies only one interaction, which is incorrect, in all simulated data sets when using the threshold of 0.1% for pseudo- $R^{2}$ ⁠.

The poor performance of the proposed method is seen when the true match prevalence is close to 0% or 100%. This is expected since the score test is derived based on the FS model that may produce biassed parameter estimates. In record linkage, the FS model is known to have poor performance when there is a lack of overlap between two linked data files (Winkler, 2014), resulting in extremely small match prevalence. Prior research has also demonstrated the poor performance of the FS model when the match prevalence is extremely large (Xu et al., 2019). Although our proposed method generally produces a conditional dependence latent class model that works similarly or better than the FS model, its performance is not optimal when the true match prevalence is extremely small or extremely large. We therefore recommend that blocking schemes are selected to produce less extreme and more balanced data. Furthermore, the FS model produces severely biassed parameter estimates when match prevalence is extreme and conditional dependence exists (Xu et al., 2019). The large biases in the parameter estimates produce biassed score test statistics, leading to incorrect identification of the conditional dependence structure. In Scenario I of our simulation, Model 3 correctly identifies the conditional dependence structure in 100% of the simulated data sets with prevalence of 0.01 and 0.05 and 81% of the simulated data sets with prevalence of 0.99 when the score test statistics are derived using the true values of the model parameters (prevalence, m- and u-probabilities). With the correct conditional dependence structure, the average BS of Model 3 is substantially reduced. This emphasises the importance of conducting a manual review to evaluate the gold standard match status of record pairs, which is helpful in correcting the biases in parameter estimates. In record linkage, manual review ascertaining the true match status is often performed for a subset of record pairs for various reasons. Manual review results, however, are rarely used in linkage models to improve parameter estimation or match performance. We are currently investigating approaches to incorporating manual review results in the proposed method and we believe that this hybrid approach may yield a substantial improvement in the correct identification of the conditional dependence structure and consequently enhanced match performance.

Another possible strategy to select the conditional dependence is to include interactions sequentially. Restrictions on the independence of fields are relaxed for one pair of fields at a time and the score test is computed based on the model assuming the independence of the two fields. If the score test statistic is larger than a certain threshold of the negative log-likelihood of the model, the corresponding interaction is added to the model. Although more computationally burdensome, this sequential selection strategy can be readily implemented. Future research will be performed to evaluate its performance relative to the proposed method.

DATA AVAILABILITY STATEMENT

The programming codes and the related data are available on https://huipxu.pages.iu.edu/publications.html.

ACKNOWLEDGEMENTS

This project was supported by grant numbers R01HS023808 from the Agency for Healthcare Research and Quality and ME-2017C1-6425 from the Patient-Centered Outcomes Research Institute. The content is solely the responsibility of the authors and does not necessarily represent the official views of the Agency for Healthcare Research and Quality or the Patient-Centered Outcomes Research Institute.

REFERENCES

Albert

,

P.

&

Dodd

,

L.

(

2004

)

A cautionary note on the robustness of latent class models for estimating diagnostic error without a gold standard

.

Biometrics

,

60

,

427

–

435

.

Armstrong

,

J.

&

Mayda

,

J.

(

1992

) Estimation of record linkage models using dependent data. In:

JSM Proceedings, Survey Research Methods Section

.

Alexandria, VA

:

American Statistical Association

, pp.

853

–

858

. Available from http://www.asasrms.org/Proceedings/y1992f.html [Accessed 18th August 2022].

Google Preview

Brier

,

G.

(

1950

)

Verification of forecasts expressed in terms of probability

.

Monthly Weather Review

,

78

,

1

–

3

.

Byrne

,

S.

(

2016

)

A note on the use of empirical AUC for evaluating probabilistic forcasts

.

Electronic Journal of Statistics

,

10

,

380

–

393

.

Clogg

,

C.

(

1995

) Chapter 6. Latent class models. In:

Arminger

,

G.

,

Clogg

,

C.

&

Sobel

,

M.E.

(Eds.)

Handbook of statistical modeling for the social and behavioral sciences

.

New York

:

Plenum

, pp.

311

–

359

.

Enamorado

,

T.

,

Fifield

,

B.

&

Imai

,

K.

(

2019

)

Using a probabilistic model to assist merging of large-scale administrative records

.

American Political Science Review

,

113

,

353

–

371

.

Engle

,

R.

(

1984

) Chapter 13. Wald, likelihood ratio, and lagrange multiplier tests in econometrics. In:

Intriligator

,

M.

&

Griliches

,

Z.

(Eds.)

Handbook of econometrics

, Vol.

2

.

North-Holland, Amsterdam

:

Elsevier

, pp.

775

–

826

.

Fellegi

,

I.

&

Sunter

,

A.

(

1969

)

A theory for record linkage

.

Journal of the American Statistical Association

,

64

,

1183

–

1210

.

Garrett

,

E.

&

Zeger

,

S.

(

2000

)

Latent class model diagnosis

.

Biometrics

,

56

,

1055

–

1067

.

Gneiting

,

T.

&

Raftery

,

A.

(

2007

)

Strictly proper scoring rules, prediction, and estimation

.

Journal of the American Statistical Association

,

102

,

359

–

378

.

Goldstein

,

H.

&

Harron

,

K.

(

2015

) Chapter 6. Record linkage: a missing data problem. In:

Harron

,

K.

,

Goldstein

,

H.

&

Dibben

,

C.

(Eds.)

Methodological developments in data linkage

.

London

:

Wiley

, pp.

109

–

124

.

Google Preview

Goodman

,

L.

(

1974

)

Exploratory latent structure analysis using both identifiable and unidentifiable models

.

Biometrika

,

61

,

215

–

231

.

Hand

,

D.

(

2006

)

Classifier technology and the illusion of progress

.

Statistical Science

,

21

,

1

–

14

.

PubMed

Hand

,

D.

&

Yu

,

K.

(

2001

)

Idiot's bayes - not so stupid after all?

International Statistical Review

,

69

,

385

–

398

.

Jones

,

G.

,

Johnson

,

W.

,

Hanson

,

T.

&

Christensen

,

R.

(

2010

)

Identifiability of models for multiple diagnostic testing in the absence of a gold standard

.

Biometrics

,

66

,

855

–

863

.

Li

,

X.

,

Xu

,

H.

,

Shen

,

C.

&

Grannis

,

S.

(

2018

)

Automated linkage of patient records from disparate sources

.

Statistical Methods in Medical Research

,

527

,

172

–

184

.

Newcombe

,

H.

&

Kennedy

,

J.

(

1962

)

Record linkage: making maximum use of the discriminating power of identifying information

.

Communications of the Associations for Computing Machinery (ACM)

,

5

,

563

–

566

.

Oberski

,

D.

&

Vermunt

,

J.

(

2018

)

The expected parameter change (EPC) for local dependence assessment in binary data latent class models

. arXiv preprint arXiv:1801.02400.

Oberski

,

D.

,

van

Kollenburg

,

G.H.

&

Vermunt

,

J.

(

2013

)

A monte carlo evaluation of three methods to detect local dependence in binary data latent class models

.

Advances in Data Analysis and Classification

,

7

,

267

–

279

.

Ong

,

T.

,

Mannino

,

M.

,

Schilling

,

L.

&

Kahn

,

M.

(

2014

)

Improving record linkage performance in the presence of missing linkage data

.

Journal of Biomedical Informatics

,

52

,

43

–

54

.

Qu

,

Y.

,

Tan

,

M.

&

Kutner

,

M.

(

1996

)

Random effects models in latent class analysis for evaluating accuracy of diagnostic tests

.

Biometrics

,

52

,

797

–

810

.

Rao

,

C.

(

1948

)

Large sample tests of statistical hypotheses concerning several parameters with applications to problems of estimation

.

Mathematical Proceedings of the Cambridge Philosophical Society

,

44

(

1

),

50

–

57

.

Sadinle

,

M.

(

2014

)

Detecting duplicates in a homicide registry using a Bayesian partitioning approach

.

Annals of Applied Statistics

,

8

,

2404

–

2434

.

Sadinle

,

M.

(

2017

)

Bayesian estimation of bipartite matchings for record linkage

.

Journal of the American Statistical Association

,

112

,

600

–

612

.

Sariyar

,

M.

,

Borg

,

A.

&

Pommerening

,

K.

(

2012

)

Missing values in deduplication of electronic patient data

.

Journal of the American Medical Informatics Association

,

19

,

e76

–

e82

.

Stanghellini

,

E.

&

Vantaggi

,

B.

(

2013

)

Identification of discrete concentration graph models with one hidden binary variable

.

Bernoulli

,

19

,

1820

–

1937

.

Subtil

,

A.

,

de

Oliveira

,

M.

&

Gonçalves

,

L.

(

2012

)

Conditional dependence diagnostic in the latent class model: a simulation study

.

Statistics & Probability Letters

,

82

,

1407

–

1412

.

Thibaudeau

,

Y.

(

1993

)

The discrimination power of dependency structures in record linkage

.

Survey Methodology

,

19

,

31

–

38

.

Vermunt

,

J.

&

Magidson

,

J.

(

2005

)

Technical guide for latent gold 4.0: basic and advanced

. Belmont, MA: Statistical Innovations.

Winkler

,

W.

(

1989

)

Methods for adjusting for lack of independence in an application of the Fellegi-Sunter model of record linkage

.

Survey Methodology

,

15

,

101

–

117

.

Winkler

,

W.

(

2014

)

Matching and record linkage

.

WIREs Computational Statistics

,

6

,

313

–

325

.

Xu

,

H.

,

Li

,

X.

,

Shen

,

C.

,

Hui

,

S.

&

Grannis

,

S.

(

2019

)

Incorporating conditional dependence in latent class models for probabilistic record linkage: Does it matter?

Annals of Applied Statistics

,

13

,

1753

–

1790

.

Xu

,

H.

,

Li

,

X.

&

Grannis

,

S.

(

2022

)

A simple two-step procedure using the Fellegi-Sunter model for frequency-based record linkage

.

Journal of Applied Statistics

,

49

,

2789

–

2804

.

APPENDIX

In the Appendix, we provide details about the score test statistics for the three types of conditional dependence structures for each pair of matching fields in the DOB-ZIP, TEL and NKLN-NKFN blocking schemes.

TABLE A1

Score test statistics for the DOB-ZIP blocking scheme of the newborn screening data deduplication

Pair number	Field 1	Field 2	Score test statistics			Maximum statistic	Pseudo- $R^{2}$
Pair number	Field 1	Field 2	Constant dependency	Non-match only	Match only	Maximum statistic	Pseudo- $R^{2}$
1	MRN	ADR	688,998.8	42.4	688,984.5	688998.8	2.08%
2	LN	ADR	3814.2	52,5476.6	63.5	525476.6	1.59%
3	MRN	SEX	348,779.5	2066.5	347,316.7	348779.5	1.06%
4	MRN	MI	293,642.7	128.9	293,789.4	293789.4	0.89%
5	LN	MI	168,218.8	133.0	174,438.2	174438.2	0.53%
6	MRN	FN	171,093.4	1.4	171,133.3	171133.3	0.52%
7	FN	SEX	93,436.7	25,668.9	141,812.9	141812.9	0.43%
8	LN	FN	36,857.2	128,675.3	12,844.9	128675.3	0.39%
9	FN	MI	101,985.2	128.4	105,686.3	105686.3	0.32%
10	MI	ADR	72,615.8	932.9	75,678.0	75678.0	0.23%
11	TEL	ADR	29051.3	38,113.2	23,125.9	38113.2	0.12%
12	LN	SEX	10,709.9	29,746.1	4617.1	29746.1	0.09%
13	LN	TEL	4368.4	27,598.4	1799.3	27598.4	0.08%
14	MRN	TEL	24,744.9	32.3	24,758.5	24,758.5	0.07%
15	MI	TEL	9770.6	10.8	10,172.0	10,172.0	0.03%
16	SEX	MI	8192.0	1446.3	9657.7	9657.7	0.03%
17	MRN	LN	7567.7	161.5	7581.9	7581.9	0.02%
18	SEX	ADR	147.1	3015.6	6495.9	6495.9	0.02%
19	SEX	TEL	4857.5	6005.1	104.4	6005.1	0.02%
20	FN	TEL	15.6	5027.9	177.8	5027.9	0.02%
21	FN	ADR	3680.1	0.1	3774.4	3774.4	0.01%

Pair number	Field 1	Field 2	Score test statistics			Maximum statistic	Pseudo- $R^{2}$
Pair number	Field 1	Field 2	Constant dependency	Non-match only	Match only	Maximum statistic	Pseudo- $R^{2}$
1	MRN	ADR	688,998.8	42.4	688,984.5	688998.8	2.08%
2	LN	ADR	3814.2	52,5476.6	63.5	525476.6	1.59%
3	MRN	SEX	348,779.5	2066.5	347,316.7	348779.5	1.06%
4	MRN	MI	293,642.7	128.9	293,789.4	293789.4	0.89%
5	LN	MI	168,218.8	133.0	174,438.2	174438.2	0.53%
6	MRN	FN	171,093.4	1.4	171,133.3	171133.3	0.52%
7	FN	SEX	93,436.7	25,668.9	141,812.9	141812.9	0.43%
8	LN	FN	36,857.2	128,675.3	12,844.9	128675.3	0.39%
9	FN	MI	101,985.2	128.4	105,686.3	105686.3	0.32%
10	MI	ADR	72,615.8	932.9	75,678.0	75678.0	0.23%
11	TEL	ADR	29051.3	38,113.2	23,125.9	38113.2	0.12%
12	LN	SEX	10,709.9	29,746.1	4617.1	29746.1	0.09%
13	LN	TEL	4368.4	27,598.4	1799.3	27598.4	0.08%
14	MRN	TEL	24,744.9	32.3	24,758.5	24,758.5	0.07%
15	MI	TEL	9770.6	10.8	10,172.0	10,172.0	0.03%
16	SEX	MI	8192.0	1446.3	9657.7	9657.7	0.03%
17	MRN	LN	7567.7	161.5	7581.9	7581.9	0.02%
18	SEX	ADR	147.1	3015.6	6495.9	6495.9	0.02%
19	SEX	TEL	4857.5	6005.1	104.4	6005.1	0.02%
20	FN	TEL	15.6	5027.9	177.8	5027.9	0.02%
21	FN	ADR	3680.1	0.1	3774.4	3774.4	0.01%

TABLE A1

Score test statistics for the DOB-ZIP blocking scheme of the newborn screening data deduplication

Pair number	Field 1	Field 2	Score test statistics			Maximum statistic	Pseudo- $R^{2}$
Pair number	Field 1	Field 2	Constant dependency	Non-match only	Match only	Maximum statistic	Pseudo- $R^{2}$
1	MRN	ADR	688,998.8	42.4	688,984.5	688998.8	2.08%
2	LN	ADR	3814.2	52,5476.6	63.5	525476.6	1.59%
3	MRN	SEX	348,779.5	2066.5	347,316.7	348779.5	1.06%
4	MRN	MI	293,642.7	128.9	293,789.4	293789.4	0.89%
5	LN	MI	168,218.8	133.0	174,438.2	174438.2	0.53%
6	MRN	FN	171,093.4	1.4	171,133.3	171133.3	0.52%
7	FN	SEX	93,436.7	25,668.9	141,812.9	141812.9	0.43%
8	LN	FN	36,857.2	128,675.3	12,844.9	128675.3	0.39%
9	FN	MI	101,985.2	128.4	105,686.3	105686.3	0.32%
10	MI	ADR	72,615.8	932.9	75,678.0	75678.0	0.23%
11	TEL	ADR	29051.3	38,113.2	23,125.9	38113.2	0.12%
12	LN	SEX	10,709.9	29,746.1	4617.1	29746.1	0.09%
13	LN	TEL	4368.4	27,598.4	1799.3	27598.4	0.08%
14	MRN	TEL	24,744.9	32.3	24,758.5	24,758.5	0.07%
15	MI	TEL	9770.6	10.8	10,172.0	10,172.0	0.03%
16	SEX	MI	8192.0	1446.3	9657.7	9657.7	0.03%
17	MRN	LN	7567.7	161.5	7581.9	7581.9	0.02%
18	SEX	ADR	147.1	3015.6	6495.9	6495.9	0.02%
19	SEX	TEL	4857.5	6005.1	104.4	6005.1	0.02%
20	FN	TEL	15.6	5027.9	177.8	5027.9	0.02%
21	FN	ADR	3680.1	0.1	3774.4	3774.4	0.01%

Pair number	Field 1	Field 2	Score test statistics			Maximum statistic	Pseudo- $R^{2}$
Pair number	Field 1	Field 2	Constant dependency	Non-match only	Match only	Maximum statistic	Pseudo- $R^{2}$
1	MRN	ADR	688,998.8	42.4	688,984.5	688998.8	2.08%
2	LN	ADR	3814.2	52,5476.6	63.5	525476.6	1.59%
3	MRN	SEX	348,779.5	2066.5	347,316.7	348779.5	1.06%
4	MRN	MI	293,642.7	128.9	293,789.4	293789.4	0.89%
5	LN	MI	168,218.8	133.0	174,438.2	174438.2	0.53%
6	MRN	FN	171,093.4	1.4	171,133.3	171133.3	0.52%
7	FN	SEX	93,436.7	25,668.9	141,812.9	141812.9	0.43%
8	LN	FN	36,857.2	128,675.3	12,844.9	128675.3	0.39%
9	FN	MI	101,985.2	128.4	105,686.3	105686.3	0.32%
10	MI	ADR	72,615.8	932.9	75,678.0	75678.0	0.23%
11	TEL	ADR	29051.3	38,113.2	23,125.9	38113.2	0.12%
12	LN	SEX	10,709.9	29,746.1	4617.1	29746.1	0.09%
13	LN	TEL	4368.4	27,598.4	1799.3	27598.4	0.08%
14	MRN	TEL	24,744.9	32.3	24,758.5	24,758.5	0.07%
15	MI	TEL	9770.6	10.8	10,172.0	10,172.0	0.03%
16	SEX	MI	8192.0	1446.3	9657.7	9657.7	0.03%
17	MRN	LN	7567.7	161.5	7581.9	7581.9	0.02%
18	SEX	ADR	147.1	3015.6	6495.9	6495.9	0.02%
19	SEX	TEL	4857.5	6005.1	104.4	6005.1	0.02%
20	FN	TEL	15.6	5027.9	177.8	5027.9	0.02%
21	FN	ADR	3680.1	0.1	3774.4	3774.4	0.01%

TABLE A2

Score test statistics for the TEL blocking scheme of the newborn screening data deduplication

Pair number	Field 1	Field 2	Score test statistics			Maximum statistic	Pseudo- $R^{2}$
Pair number	Field 1	Field 2	Constant dependency	Non-match only	Match only	Maximum statistic	Pseudo- $R^{2}$
1	ADR	ZIP	85,621.9	67,099.7	277,634.7	277,634.7	2.83%
2	LN	MI	79,263.3	11,005.5	109,259.2	109,259.2	1.12%
3	FN	DOB	38,059.5	38,059.5	—	38,059.5	0.39%
4	LN	DOB	33,755.8	33,755.8	—	33,755.8	0.34%
5	MRN	SEX	29,630.8	29,630.8	—	29,630.8	0.30%
6	SEX	ADR	26,277.7	26,277.8	—	26,277.8	0.27%
7	FN	ADR	24,875.5	24,265.9	4349.6	24,875.5	0.25%
8	FN	SEX	21,614.7	21,614.6	—	21,614.7	0.22%
9	FN	ZIP	16,972.5	16,444.3	2040.0	16,972.5	0.17%
10	SEX	DOB	15,677.0	15,677.0	—	15,677.0	0.16%
11	MRN	ADR	1075.8	314.5	14,875.4	14,875.4	0.15%
12	MRN	LN	13,379.9	8586.6	7019.2	13,379.9	0.14%
13	DOB	ADR	12,753.5	12,753.5	—	12,753.5	0.13%
14	MRN	ZIP	12,145.6	12,494.7	684.2	12,494.7	0.13%
15	FN	MI	12,186.3	10,328.4	4511.4	12,186.3	0.12%
16	LN	FN	8855.7	5367.4	10,265.0	10,265.0	0.10%
17	LN	SEX	8523.2	8523.2	—	8523.2	0.09%
18	MRN	DOB	6480.6	6480.6	—	6480.6	0.07%
19	MI	ADR	6435.9	5279.0	3656.1	6435.9	0.07%
20	SEX	ZIP	6342.9	6342.9	—	6342.9	0.06%
21	DOB	ZIP	6250.6	6250.6	—	6250.6	0.06%
22	MI	ZIP	5161.7	4465.9	1248.3	5161.7	0.05%
23	MRN	MI	82.8	739.2	3363.9	3363.9	0.03%
24	LN	ADR	2447.9	3285.3	521.3	3285.3	0.03%
25	LN	ZIP	682.7	1282.0	374.8	1282.0	0.01%
26	MRN	FN	616.1	975.8	861.6	975.8	0.01%
27	MI	DOB	157.4	157.4	—	157.4	0.00%
28	SEX	MI	29.4	29.4	—	29.4	0.00%

Pair number	Field 1	Field 2	Score test statistics			Maximum statistic	Pseudo- $R^{2}$
Pair number	Field 1	Field 2	Constant dependency	Non-match only	Match only	Maximum statistic	Pseudo- $R^{2}$
1	ADR	ZIP	85,621.9	67,099.7	277,634.7	277,634.7	2.83%
2	LN	MI	79,263.3	11,005.5	109,259.2	109,259.2	1.12%
3	FN	DOB	38,059.5	38,059.5	—	38,059.5	0.39%
4	LN	DOB	33,755.8	33,755.8	—	33,755.8	0.34%
5	MRN	SEX	29,630.8	29,630.8	—	29,630.8	0.30%
6	SEX	ADR	26,277.7	26,277.8	—	26,277.8	0.27%
7	FN	ADR	24,875.5	24,265.9	4349.6	24,875.5	0.25%
8	FN	SEX	21,614.7	21,614.6	—	21,614.7	0.22%
9	FN	ZIP	16,972.5	16,444.3	2040.0	16,972.5	0.17%
10	SEX	DOB	15,677.0	15,677.0	—	15,677.0	0.16%
11	MRN	ADR	1075.8	314.5	14,875.4	14,875.4	0.15%
12	MRN	LN	13,379.9	8586.6	7019.2	13,379.9	0.14%
13	DOB	ADR	12,753.5	12,753.5	—	12,753.5	0.13%
14	MRN	ZIP	12,145.6	12,494.7	684.2	12,494.7	0.13%
15	FN	MI	12,186.3	10,328.4	4511.4	12,186.3	0.12%
16	LN	FN	8855.7	5367.4	10,265.0	10,265.0	0.10%
17	LN	SEX	8523.2	8523.2	—	8523.2	0.09%
18	MRN	DOB	6480.6	6480.6	—	6480.6	0.07%
19	MI	ADR	6435.9	5279.0	3656.1	6435.9	0.07%
20	SEX	ZIP	6342.9	6342.9	—	6342.9	0.06%
21	DOB	ZIP	6250.6	6250.6	—	6250.6	0.06%
22	MI	ZIP	5161.7	4465.9	1248.3	5161.7	0.05%
23	MRN	MI	82.8	739.2	3363.9	3363.9	0.03%
24	LN	ADR	2447.9	3285.3	521.3	3285.3	0.03%
25	LN	ZIP	682.7	1282.0	374.8	1282.0	0.01%
26	MRN	FN	616.1	975.8	861.6	975.8	0.01%
27	MI	DOB	157.4	157.4	—	157.4	0.00%
28	SEX	MI	29.4	29.4	—	29.4	0.00%

TABLE A2

Score test statistics for the TEL blocking scheme of the newborn screening data deduplication

Pair number	Field 1	Field 2	Score test statistics			Maximum statistic	Pseudo- $R^{2}$
Pair number	Field 1	Field 2	Constant dependency	Non-match only	Match only	Maximum statistic	Pseudo- $R^{2}$
1	ADR	ZIP	85,621.9	67,099.7	277,634.7	277,634.7	2.83%
2	LN	MI	79,263.3	11,005.5	109,259.2	109,259.2	1.12%
3	FN	DOB	38,059.5	38,059.5	—	38,059.5	0.39%
4	LN	DOB	33,755.8	33,755.8	—	33,755.8	0.34%
5	MRN	SEX	29,630.8	29,630.8	—	29,630.8	0.30%
6	SEX	ADR	26,277.7	26,277.8	—	26,277.8	0.27%
7	FN	ADR	24,875.5	24,265.9	4349.6	24,875.5	0.25%
8	FN	SEX	21,614.7	21,614.6	—	21,614.7	0.22%
9	FN	ZIP	16,972.5	16,444.3	2040.0	16,972.5	0.17%
10	SEX	DOB	15,677.0	15,677.0	—	15,677.0	0.16%
11	MRN	ADR	1075.8	314.5	14,875.4	14,875.4	0.15%
12	MRN	LN	13,379.9	8586.6	7019.2	13,379.9	0.14%
13	DOB	ADR	12,753.5	12,753.5	—	12,753.5	0.13%
14	MRN	ZIP	12,145.6	12,494.7	684.2	12,494.7	0.13%
15	FN	MI	12,186.3	10,328.4	4511.4	12,186.3	0.12%
16	LN	FN	8855.7	5367.4	10,265.0	10,265.0	0.10%
17	LN	SEX	8523.2	8523.2	—	8523.2	0.09%
18	MRN	DOB	6480.6	6480.6	—	6480.6	0.07%
19	MI	ADR	6435.9	5279.0	3656.1	6435.9	0.07%
20	SEX	ZIP	6342.9	6342.9	—	6342.9	0.06%
21	DOB	ZIP	6250.6	6250.6	—	6250.6	0.06%
22	MI	ZIP	5161.7	4465.9	1248.3	5161.7	0.05%
23	MRN	MI	82.8	739.2	3363.9	3363.9	0.03%
24	LN	ADR	2447.9	3285.3	521.3	3285.3	0.03%
25	LN	ZIP	682.7	1282.0	374.8	1282.0	0.01%
26	MRN	FN	616.1	975.8	861.6	975.8	0.01%
27	MI	DOB	157.4	157.4	—	157.4	0.00%
28	SEX	MI	29.4	29.4	—	29.4	0.00%

Pair number	Field 1	Field 2	Score test statistics			Maximum statistic	Pseudo- $R^{2}$
Pair number	Field 1	Field 2	Constant dependency	Non-match only	Match only	Maximum statistic	Pseudo- $R^{2}$
1	ADR	ZIP	85,621.9	67,099.7	277,634.7	277,634.7	2.83%
2	LN	MI	79,263.3	11,005.5	109,259.2	109,259.2	1.12%
3	FN	DOB	38,059.5	38,059.5	—	38,059.5	0.39%
4	LN	DOB	33,755.8	33,755.8	—	33,755.8	0.34%
5	MRN	SEX	29,630.8	29,630.8	—	29,630.8	0.30%
6	SEX	ADR	26,277.7	26,277.8	—	26,277.8	0.27%
7	FN	ADR	24,875.5	24,265.9	4349.6	24,875.5	0.25%
8	FN	SEX	21,614.7	21,614.6	—	21,614.7	0.22%
9	FN	ZIP	16,972.5	16,444.3	2040.0	16,972.5	0.17%
10	SEX	DOB	15,677.0	15,677.0	—	15,677.0	0.16%
11	MRN	ADR	1075.8	314.5	14,875.4	14,875.4	0.15%
12	MRN	LN	13,379.9	8586.6	7019.2	13,379.9	0.14%
13	DOB	ADR	12,753.5	12,753.5	—	12,753.5	0.13%
14	MRN	ZIP	12,145.6	12,494.7	684.2	12,494.7	0.13%
15	FN	MI	12,186.3	10,328.4	4511.4	12,186.3	0.12%
16	LN	FN	8855.7	5367.4	10,265.0	10,265.0	0.10%
17	LN	SEX	8523.2	8523.2	—	8523.2	0.09%
18	MRN	DOB	6480.6	6480.6	—	6480.6	0.07%
19	MI	ADR	6435.9	5279.0	3656.1	6435.9	0.07%
20	SEX	ZIP	6342.9	6342.9	—	6342.9	0.06%
21	DOB	ZIP	6250.6	6250.6	—	6250.6	0.06%
22	MI	ZIP	5161.7	4465.9	1248.3	5161.7	0.05%
23	MRN	MI	82.8	739.2	3363.9	3363.9	0.03%
24	LN	ADR	2447.9	3285.3	521.3	3285.3	0.03%
25	LN	ZIP	682.7	1282.0	374.8	1282.0	0.01%
26	MRN	FN	616.1	975.8	861.6	975.8	0.01%
27	MI	DOB	157.4	157.4	—	157.4	0.00%
28	SEX	MI	29.4	29.4	—	29.4	0.00%

TABLE A3

Score test statistics for the NKLN-NKFN blocking scheme of the newborn screening data deduplication

Pair number	Field 1	Field 2	Score test statistics			Maximum statistic	Pseudo- $R^{2}$
Pair number	Field 1	Field 2	Constant dependency	Non-match only	Match only	Maximum statistic	Pseudo- $R^{2}$
1	MRN	SEX	151,154.3	2612.8	164,152.1	164,152.1	2.89%
2	ADR	ZIP	111,713.9	14,916.2	97,176.4	111,713.9	1.97%
3	LN	MI	86,724.0	406.4	87,181.0	87,181.0	1.54%
4	MRN	ADR	72,124.1	123.3	73,601.2	73,601.2	1.30%
5	MRN	FN	48,627.3	24.3	49,221.0	49,221.0	0.87%
6	FN	SEX	30,105.3	1912.4	36,854.7	36,854.7	0.65%
7	FN	MI	28,196.2	422.9	27,786.9	28,196.2	0.50%
8	DOB	ZIP	21,655.2	23,518.2	104.8	23,518.2	0.41%
9	TEL	ADR	23,371.4	3347.8	20,646.6	23,371.4	0.41%
10	MRN	MI	19,263.4	2.0	19,307.7	19,307.7	0.34%
11	MRN	DOB	17,954.3	17,732.6	2673.7	17,954.3	0.32%
12	LN	FN	17,838.2	1033.4	17,097.0	17,838.2	0.31%
13	MRN	TEL	13,916.3	0.1	14,228.3	14,228.3	0.25%
14	SEX	DOB	10,174.1	9727.9	590.0	10,174.1	0.18%
15	DOB	TEL	7758.5	9667.0	75.3	9667.0	0.17%
16	FN	DOB	8794.0	9553.1	129.1	9553.1	0.17%
17	LN	ADR	4081.2	74.1	4140.1	4140.1	0.07%
18	MI	TEL	3989.5	0.2	4044.5	4044.5	0.07%
19	SEX	MI	3193.8	11.3	3507.8	3507.8	0.06%
20	DOB	ADR	1506.0	2634.3	179.8	2634.3	0.05%
21	TEL	ZIP	2351.7	1358.8	1248.9	2351.7	0.04%
22	MI	ADR	2296.6	3.2	2331.5	2331.5	0.04%
23	SEX	ADR	1724.7	416.1	1344.2	1724.7	0.03%
24	LN	TEL	938.7	6.7	1181.2	1181.2	0.02%
25	LN	SEX	456.2	1.6	1073.9	1073.9	0.02%
26	FN	TEL	25.1	670.5	157.0	670.5	0.01%
27	SEX	ZIP	86.5	585.4	490.2	585.4	0.01%
28	LN	ZIP	322.8	7.0	577.5	577.5	0.01%
29	SEX	TEL	70.6	539.7	118.1	539.7	0.01%
30	FN	ZIP	118.3	413.3	481.3	481.3	0.01%
31	FN	ADR	306.7	134.4	239.3	306.7	0.01%
32	MRN	ZIP	120.6	301.5	245.1	301.5	0.01%
33	LN	DOB	211.6	132.1	149.8	211.6	0.00%
34	MI	DOB	128.7	33.8	102.9	128.7	0.00%
35	MRN	LN	80.9	6.3	93.6	93.6	0.00%
36	MI	ZIP	45.0	25.9	62.7	62.7	0.00%

Pair number	Field 1	Field 2	Score test statistics			Maximum statistic	Pseudo- $R^{2}$
Pair number	Field 1	Field 2	Constant dependency	Non-match only	Match only	Maximum statistic	Pseudo- $R^{2}$
1	MRN	SEX	151,154.3	2612.8	164,152.1	164,152.1	2.89%
2	ADR	ZIP	111,713.9	14,916.2	97,176.4	111,713.9	1.97%
3	LN	MI	86,724.0	406.4	87,181.0	87,181.0	1.54%
4	MRN	ADR	72,124.1	123.3	73,601.2	73,601.2	1.30%
5	MRN	FN	48,627.3	24.3	49,221.0	49,221.0	0.87%
6	FN	SEX	30,105.3	1912.4	36,854.7	36,854.7	0.65%
7	FN	MI	28,196.2	422.9	27,786.9	28,196.2	0.50%
8	DOB	ZIP	21,655.2	23,518.2	104.8	23,518.2	0.41%
9	TEL	ADR	23,371.4	3347.8	20,646.6	23,371.4	0.41%
10	MRN	MI	19,263.4	2.0	19,307.7	19,307.7	0.34%
11	MRN	DOB	17,954.3	17,732.6	2673.7	17,954.3	0.32%
12	LN	FN	17,838.2	1033.4	17,097.0	17,838.2	0.31%
13	MRN	TEL	13,916.3	0.1	14,228.3	14,228.3	0.25%
14	SEX	DOB	10,174.1	9727.9	590.0	10,174.1	0.18%
15	DOB	TEL	7758.5	9667.0	75.3	9667.0	0.17%
16	FN	DOB	8794.0	9553.1	129.1	9553.1	0.17%
17	LN	ADR	4081.2	74.1	4140.1	4140.1	0.07%
18	MI	TEL	3989.5	0.2	4044.5	4044.5	0.07%
19	SEX	MI	3193.8	11.3	3507.8	3507.8	0.06%
20	DOB	ADR	1506.0	2634.3	179.8	2634.3	0.05%
21	TEL	ZIP	2351.7	1358.8	1248.9	2351.7	0.04%
22	MI	ADR	2296.6	3.2	2331.5	2331.5	0.04%
23	SEX	ADR	1724.7	416.1	1344.2	1724.7	0.03%
24	LN	TEL	938.7	6.7	1181.2	1181.2	0.02%
25	LN	SEX	456.2	1.6	1073.9	1073.9	0.02%
26	FN	TEL	25.1	670.5	157.0	670.5	0.01%
27	SEX	ZIP	86.5	585.4	490.2	585.4	0.01%
28	LN	ZIP	322.8	7.0	577.5	577.5	0.01%
29	SEX	TEL	70.6	539.7	118.1	539.7	0.01%
30	FN	ZIP	118.3	413.3	481.3	481.3	0.01%
31	FN	ADR	306.7	134.4	239.3	306.7	0.01%
32	MRN	ZIP	120.6	301.5	245.1	301.5	0.01%
33	LN	DOB	211.6	132.1	149.8	211.6	0.00%
34	MI	DOB	128.7	33.8	102.9	128.7	0.00%
35	MRN	LN	80.9	6.3	93.6	93.6	0.00%
36	MI	ZIP	45.0	25.9	62.7	62.7	0.00%

TABLE A3