Abstract

The Fellegi–Sunter model has been widely used in probabilistic record linkage despite its often invalid conditional independence assumption. Prior research has demonstrated that conditional dependence latent class models yield improved match performance when using the correct conditional dependence structure. With a misspecified conditional dependence structure, these models can yield worse performance. It is, therefore, critically important to correctly identify the conditional dependence structure. Existing methods for identifying the conditional dependence structure include the correlation residual plot, the log-odds ratio check, and the bivariate residual, all of which have been shown to perform inadequately. Bootstrap bivariate residual approach and score test have also been proposed and found to have better performance, with the score test having greater power and lower computational burden. In this paper, we extend the score-test-based approach to account for different conditional dependence structures. Through a simulation study, we develop practical recommendations on the utilisation of the score test and assess the match performance with conditional dependence identified by the proposed method. Performance of the proposed method is further evaluated using a real-world record linkage example. Findings show that the proposed method leads to improved matching accuracy relative to the Fellegi–Sunter model.

1 INTRODUCTION

Latent class models have been widely used in a broad range of fields, one of which is record linkage. In record linkage, records from disparate data sources are compared and matches belonging to the same entity are identified. Due to the lack of a unique identifer, the identification of matches typically relies on quasi-identifiers such as name, address, birth date, and phone number. Comparison of these quasi-identifiers, known as matching fields, for a pair of records results in a vector of binary values with one indicating agreement and zero indicating disagreement. Utilising these binary agreement vectors with the underlying true match status as the latent class, latent class models estimate the probabilities that record pairs are true matches given their agreement vectors. These probabilities are then used to inform the decision rule for the identification of true matches. One latent class model that plays a vital role in record linkage is the conditional independence model. In this model, the agreement patterns for multiple matching fields are assumed to be independent conditional on the true match status. This model was first proposed by Newcombe and Kennedy (1962) and later formalised by Fellegi and Sunter (1969). As an unsupervised classification algorithm, the Fellegi–Sunter (FS) model has demonstrated reasonable performance in many applications and has been widely used as a core component of probabilistic linkage algorithms.

The conditional independence assumption of the FS model is a strong assumption that may not be valid in practice (Armstrong & Mayda, 1992; Thibaudeau, 1993; Winkler, 1989). For example, if two records do not belong to the same person but share the same phone number, they likely belong to different members of the same household and hence are more likely to also share the same address. Simulation studies have shown that conditional dependence latent class models produce comparable or better performance than the FS model (Xu et al., 2019). However, this is true only with the correct conditional dependence structure. When the conditional dependence structure is misspecified, the model could yield a worse performance than the FS model (Li et al., 2018). These findings are consistent to those demonstrated by Albert and Dodd (2004) regarding the use of latent class models for the evaluation of sensitivity and specificity of diagnostic tests in the absence of a gold standard, where it was noted that the conditional dependence latent class models lack robustness with regard to the conditional dependence structure and misspecified structures can seriously bias the diagnostic accuracy estimation.

These findings highlight the importance of correctly identifying the conditional dependence structure in latent class models. Existing approaches include the correlation residual plot (Qu et al., 1996), the log-odds ratio check (Garrett & Zeger, 2000), the bivariate residual (Vermunt & Magidson, 2005) and the score test (Oberski et al., 2013). The correlation residual plot and the log-odds ratio check have been found to have poor performance in identifying the correct conditional dependence structure (Subtil et al., 2012). The bivariate residual approach, on the other hand, has the drawback that its asymptotic distribution is not known. Alternatively, Oberski et al. (2013) proposed a parametric bootstrap method to derive the empirical distribution of the bivariate residuals. In the same study, the score test was developed and found to perform equally well as the bootstrap bivariate residual approach. The score test is more appealing than the bootstrap bivariate residual approach because of its lower computational burden. It is therefore a promising approach to use in record linkage to improve record matching accuracy.

The score test proposed by Oberski et al. (2013) is limited in two ways. First, it is based on the assumption of a constant conditional dependency parameter across latent classes and does not allow conditional dependence to exist in only one latent class. In record linkage, the conditional dependence may exist in only one latent class. For example, telephone and address may be strongly correlated in the non-match class. In the match class, the correlation might be much weaker since the disagreement in each field is likely due to random typographical errors. Second, the score test identifies conditional dependence based on statistical significance. Since record linkage applications usually involve linkage of large databases and often have extremely large sample sizes, score test statistics would almost always be highly statistically significant, even if the conditional dependence is weak or even negligible.

We therefore propose to extend the score test proposed by Oberski et al. (2013) in two ways: develop the score test to evaluate conditional dependence in one class only, and identify important conditional dependence to incorporate in the model based on the improvement in model fit indicated by a pseudo-R2 measure. The proposed method is presented in Section 2. The performance of the proposed method is evaluated using a simulation study in Section 3 and a real-world example in Section 4, where a newborn screening database is deduplicated. A brief concluding remark is presented in Section 5.

2 PROPOSED METHOD

Assume that there are J matching fields for a total of N record pairs. For every record pair, a vector of J binary variables is used to represent whether each field agrees or not. This results in a total of K=2J unique agreement patterns. For the kth unique pattern yk=(yk1,yk2,,ykJ), the marginal probability P(yk) is written as the mixture probability due to the unknown true match status:

(1)

where M is the set of record pairs that are true matches and U is the set of record pairs that are true non-matches. P(M) is the match prevalence and P(U)=1P(M) represents the prevalence of the true non-matches. P(yk|M) and P(yk|U) are the conditional distributions of the agreement vector yk given the true match status.

For the characterisation of these conditional distributions, the FS model makes the conditional independence assumption. The agreement or disagreement of a matching field is assumed to be independent of the agreement or disagreement of another matching field, conditional on the true match status. In other words,

where the conditional probabilities P(ykj=1|M)=mj are referred to as the m-probabilities and P(ykj=1|U)=uj are the u-probabilities. This assumption, however, does not always hold in real-world record linkage applications (Thibaudeau, 1993; Winkler, 1989). Conditional dependence among matching fields may exist in one or both latent classes. When it does, the FS model provides an inadequate fit to the data, yields biassed parameter estimates, and produces impaired matching accuracy (Li et al., 2018; Xu et al., 2019).

2.1 Conditional dependence latent class model

One model that naturally extends the FS model and allows conditional dependence among matching fields is the log-linear latent class model. This is because the FS model can be conveniently reformulated using the log-linear model framework as follows (Clogg, 1995):

where

The parameter vector τ=(τ1,τ2,,τJ) is for the main effect of y and the parameter vector λ=(λ1,λ2,,λJ) is for the association between y and the latent match class. Under this formulation, the m- and u-probabilities are mj=eτj+λj1+eτj+λj and uj=eτj1+eτj.

The conditional dependence between two fields j1 and j2 can be incorporated b y adding interaction terms to ηk1 and/or ηk0. In this paper, we consider three types of conditional dependence structure. If fields j1 and j2 are conditionally dependent in both match and non-match classes, we follow Oberski et al. (2013) and Oberski and Vermunt (2018)'s work and incorporate a constant log-linear dependency parameter ϕ in both ηk1 and ηk0 as follows:

(2)

If the conditional dependence exists among true matches only, the dependency parameter ϕ is added to ηk1 only as follows:

(3)

If, on the other hand, the conditional dependence exists only among the true non-matches, we add the dependency parameter ϕ to ηk0 only:

(4)

2.2 Score test

We will evaluate the null hypothesis of conditional independence, H0:ϕ=0, between ykj1 and ykj2 using the classical score test (Rao, 1948). The likelihood function of the log-linear latent class model is written as

(5)

where nk is the number of record pairs that share the agreement pattern yk, P(M)=eα1+eα, and P(U)=11+eα. The vector of parameters is ψ=(θ,ϕ), where θ=(α,τ,λ).

We define the score as

and the Fisher information matrix as

Partition the Fisher information matrix as follows:

the score test statistic is then defined as

(6)

where the score and the Fisher information matrix are evaluated at ϕ=0 and θ=θ^, the maximum likelihood estimate under the null hypothesis of conditional independence, that is, the FS model. The score test statistic follows a χ2 distribution with one degree of freedom asymptotically if the null hypothesis is correct. Although the conditional dependence model under the alternative hypothesis can be fit and a likelihood ratio test be calculated to evaluate the null hypothesis, the advantage of the score test is that its computation is only based on the FS model. With many possibilities of conditional dependence structure, the use of the score test can be computationally convenient. Another advantage of the score test is related to model identifiability, which is discussed in 2.3.

As pointed by Oberski et al. (2013) and Oberski and Vermunt (2018), with the constant log-linear dependency parameter as in (2), the score is

(7)

which is equal to the difference between the observed and expected number of record pairs with agreement in both fields j1 and j2. In addition, the score is

(8)

if the conditional dependence is in the true match class only and is

(9)

if the conditional dependence is in the true non-match class only. Calculation of the Fisher information matrix I requires the evaluation of P(yk)ψ, which can bee derived as follows:

Similar to the derivation of the score above, P(yk)ϕ is equal to

(10)

with a constant log-linear conditional dependency parameter in both latent classes. It is

when the conditional dependence is in the true match class only and

when the conditional dependence is in the true non-match class only.

The calculation of the score test statistics for the three types of conditional dependence structure is implemented using SAS. The SAS macro for the score test and the related codes for fitting the FS model for the real-world linkage example are available at http://huipxu.pages.iu.edu/publications.html.

2.3 Model identifiability

Identifiability of the latent class models is an important issue that affects the interpretation and the validity of the results. It is known that latent class models are not globally identifiable due to the label-switching problem. When considering the issue of local identifiability, that is, whether a unique set of parameter values can generate a given likelihood in an open neighbourhood, the number of parameters in the latent class model needs to be less than or equal to the degrees of freedom K1. However, this is only a necessary, but not a sufficient, condition for local identifiability. In Goodman (1974), the Jacobian matrix of the partial derivatives with respect to parameters θ=(α,τ,λ,ϕ)

was used to evaluate the local identifiability of the latent class model. If J(θ) has full rank, the latent class model is locally identifiable.

In general, the identifiability of the latent class models is quite complex (Jones et al., 2010; Stanghellini & Vantaggi, 2013). However, with a constant log-linear conditional dependency parameter in (2), the identifiability of the latent class model is greatly simplified. Oberski and Vermunt (2018) showed that the additional parameter for conditional dependence is identifiable in the neighbourhood of the maximum likelihood estimates of the FS model as long as there are positive degrees of freedom. For the situations where conditional dependence exists only in one of the two latent classes, the identifiability of the model parameters can be monitored by evaluating the singularity of the Fisher information matrix. As noted by Jones et al. (2010), the Jacobian matrix has a direct effect on the precision of the maximum likelihood estimates since

where Ip is the (K1)×(K1) Fisher information matrix for an unrestricted multinomial model for the data. If the additional parameter for conditional dependence is not identifiable, the Jacobian J(ψ) will be less than full rank, resulting in a singular matrix I(ψ). In other words, the singularity of I(ψ) is indicative that the log-linear latent class model is not identifiable.

2.4 Evaluation of conditional dependence

It is well recognised that the gain in predictive accuracy from building increasingly more complex models decreases dramatically and simple models can account for over 90% of the achievable predictive power in many situations (Hand, 2006). Simple models, such as the naive Bayes rule for supervised classification, can perform surprisingly well, partly because of the low variance in the probability estimates (Hand & Yu, 2001). Although the simpler models produce biassed probability estimates, the bias may be inconsequential for classification. It is therefore reasonable to accommodate strong conditional dependence among matching fields since ignoring them will lead to large model bias and potentially impair the matching accuracy. Weak conditional dependence will be ignored, which potentially reduces the variance in the probability estimation at the cost of small to negligible bias.

The magnitude of the conditional dependence can be evaluated by the improvement in log-likelihood relative to the FS model as the corresponding log-linear conditional dependence parameter is freed. We can therefore approximate the improvement in log-likelihood using half the score test statistic since score tests are asymptotically equivalent to likelihood ratio tests, whose test statistics are twice the improvement in log-likelihood (Engle, 1984). We propose to evaluate the strength of the conditional dependence using the approximated McFadden's pseudo-R2 defined as follows:

where l(θ^,0) is the log-likelihood of the FS model. Since all three types of conditional dependence structures, including (2), (3) and (4), involve one additional parameter ϕ, the type with the largest score test statistic is considered. We consider conditional dependence to be sufficiently strong if its pseudo-R2 is greater than a specific threshold. The corresponding interaction term is then included in the log-linear conditional dependence model. Selection of the threshold will be evaluated in the simulation study in Section 3.

2.5 Missing data

Missing data in record linkage is common and arises for many reasons, resulting in uncertain agreement status for record pairs. Record linkage practitioners often treat missing data as disagreement (Goldstein & Harron, 2015; Ong et al., 2014; Sariyar et al., 2012). This is problematic because it fails to recognise that missing data occur for both disagreeing and agreeing record pairs. This strategy also artificially inflates conditional dependence among fields with missing data, which is concerning due to the common practice in record linkage to parse one field into separate pieces of information. For example, addresses are often parsed into street, city, state and zip code. When address is missing, all parsed fields of the address are missing simultaneously. Treating these missing values as disagreement will result in inflated conditional dependence among these fields. We therefore follow Sadinle (2014, 2017), Enamorado et al. (2019) and Xu et al. (2022) to consider data as missing at random and handle the missing values using the full-information likelihood approach.

3 SIMULATION STUDY

In this section, we examine the performance of the proposed method using a simulation study. We use the same simulation setup as in Xu et al. (2019). Four scenarios are considered in the simulation study, all with six matching fields. In Scenarios I, II and III, fields have the same moderate discriminating power with m-probabilities 0.45, 0.25, 0.85, 0.95, 0.98, 0.99 and u-probabilities 0.2, 0.05, 0.2, 0.3, 0.1, 0.05. In Scenario IV, fields have high discriminating power with m-probabilities 0.85, 0.9, 0.85, 0.95, 0.98, 0.99 and u-probabilities 0.05, 0.1, 0.02, 0.05, 0.02, 0.01. The discriminating power of a field is quantified by the ratio of its m- and u-probabilities. Fields with a ratio of 1 have no contribution to the classification of the true match status and fields are better at discriminating matches from non-matches as the ratio increases. In Scenarios I, II and III, the ratios range from 2.25 to 19.8, while in Scenario IV, ratios range from 9 to 99. Scenario IV therefore has better discriminating fields. Conditional dependence exists among the first four fields in both match and non-match classes in Scenarios I and IV, only in match class in Scenario II, and only in non-match class in Scenario III. Specifically, the first four fields share a Gaussian random effect with coefficients 1, 2, 0.5, and 1.5 in the match class and coefficients 0.5, 1, 2, 0.5 in the non-match class. As a result, the conditional dependence in the match class is relatively stronger than in the non-match class for Scenarios I and IV.

For each scenario, we consider a sample size of one million record pairs. This large sample size is typical in record linkage applications. We also consider nine different values for the true match prevalence: 0.01, 0.05, 0.1, 0.3, 0.5, 0.7, 0.9, 0.95 and 0.99. For each scenario and true match prevalence value, we simulate 100 data sets that include the field agreement patterns of the one million record pairs based on the Gaussian random effects latent class model. For each simulated data set, we fit the FS model, the true model, and three conditional dependence latent class models with interactions selected by the proposed method using different thresholds for pseudo-R2. The true model is the log-linear latent class model with class-varying conditional dependency parameters among the first four fields. Among the three conditional latent class models, Model 1 includes all interactions with a pseudo-R2 at or above 1%, Model 2 includes all interactions with a pseudo-R2 at or above 0.5%, and Model 2 includes all interactions with a pseudo-R2 at or above 0.1%. Selection of these thresholds is informed by the real-world example in Section 4.

We use the Brier score (BS) to evaluate the match performance of different models. In record linkage, measures such as the area under the Receiver Operating Characteristic curve and misclassification errors are often used to evaluate the accuracy of probabilistic linkage algorithms. Such measures, however, are not appropriate because they are not proper scoring rules (Byrne, 2016). The BS, on the other hand, is a strictly proper scoring rule to measure the accuracy of probabilistic predictions in the sense that it is minimised when the predicted probabilities are correct (Gneiting & Raftery, 2007). The BS is the mean squared error of the model predicted probabilities (Brier, 1950) defined as follows:

where pi=P(M|yi) is the probability that the ith record pair is a true match given its agreement vector and oi is the indicator for the true match status. Models with lower BS have better linkage performance.

Results of the simulation study are summarised in Table 1 for Scenarios I, II and III with moderately discriminating fields and Table 2 for Scenario IV with highly discriminating fields. The difference in BS between each model relative to the true model is calculated for each simulated data set and the mean and SD across the 100 simulated data sets are presented. For the three conditional dependence latent class models, we also present the median number of interactions selected across the simulated data sets using the proposed method. Since the true model contains six pairwise interactions of the first four matching fields, we also present the median number of interactions selected correctly.

TABLE 1

Mean and SD of the difference in Brier score (BS) for Fellegi–Sunter (FS) model and conditional dependence latent class models (Model 1: pseudo-R2 1%; Model 2: pseudo-R2 0.5%; Model 3: pseudo-R2 0.1%) relative to the true model and median number of interactions selected in Scenario I, II and III

True match prevalenceModelScenario IScenario IIScenario III
BSInteractionsBSInteractionsBSInteractions
MeanSDTotalCorrectMeanSDTotalCorrectMeanSDTotalCorrect
0.01FS0.12690.001000.00020.000020.12980.00100
Model 10.12690.00100000.00020.00002000.12980.0010000
Model 20.12610.00142000.00020.00002000.12760.0009510
Model 30.12530.00097100.00020.00002000.12760.0009510
0.05FS0.06860.000760.00060.000020.07130.00075
Model 10.09390.00068100.00060.00002000.09670.0006610
Model 20.09390.00068100.00060.00002000.09670.0006610
Model 30.06770.00661520.00040.00002110.07190.0034252
0.1FS0.01180.003450.00090.000030.00850.00252
Model 10.05730.00725210.00090.00003000.05770.0038821
Model 20.03340.00082430.00090.00003000.04760.0012243
Model 30.03510.03873750.00070.00003110.01100.0097576
0.3FS0.00430.000070.00230.000050.00250.00007
Model 10.04270.00089210.00230.00005000.04270.0028221
Model 20.01370.00056540.00180.00005110.03570.0009543
Model 30.00030.00002760.00310.00019430.00380.0028876
0.5FS0.00330.000060.00250.000050.00130.00004
Model 10.03280.00023210.00170.00004110.03340.0002410
Model 20.01810.01162430.00170.00004110.02880.0002321
Model 30.00020.00002760.00300.00010650.01940.0011665
0.7FS0.00210.000050.00190.000060.00050.00002
Model 10.01970.00019210.00110.00004110.01930.0002010
Model 20.01580.00038430.01600.00024210.01930.0002010
Model 30.00080.00011760.00000.00000760.01460.0006243
0.9FS0.00140.000070.00120.000070.00020.00001
Model 10.02450.00040210.04580.00114210.00020.0000100
Model 20.01290.00018540.01580.00022320.00150.0025300
Model 30.00020.00006760.00000.00000760.00550.0004021
0.95FS0.00770.000500.00670.000610.00020.00001
Model 10.05930.00060210.06160.00069210.00020.0000100
Model 20.01300.00081430.02020.00039320.00020.0000100
Model 30.00010.00003760.00320.00331760.00300.0000710
0.99FS0.39750.001420.40090.001400.00010.00001
Model 10.39890.00139100.40120.00137100.00010.0000100
Model 20.12890.00503310.12320.00950310.00010.0000100
Model 30.14010.00283820.13300.004811010.00010.0000100
True match prevalenceModelScenario IScenario IIScenario III
BSInteractionsBSInteractionsBSInteractions
MeanSDTotalCorrectMeanSDTotalCorrectMeanSDTotalCorrect
0.01FS0.12690.001000.00020.000020.12980.00100
Model 10.12690.00100000.00020.00002000.12980.0010000
Model 20.12610.00142000.00020.00002000.12760.0009510
Model 30.12530.00097100.00020.00002000.12760.0009510
0.05FS0.06860.000760.00060.000020.07130.00075
Model 10.09390.00068100.00060.00002000.09670.0006610
Model 20.09390.00068100.00060.00002000.09670.0006610
Model 30.06770.00661520.00040.00002110.07190.0034252
0.1FS0.01180.003450.00090.000030.00850.00252
Model 10.05730.00725210.00090.00003000.05770.0038821
Model 20.03340.00082430.00090.00003000.04760.0012243
Model 30.03510.03873750.00070.00003110.01100.0097576
0.3FS0.00430.000070.00230.000050.00250.00007
Model 10.04270.00089210.00230.00005000.04270.0028221
Model 20.01370.00056540.00180.00005110.03570.0009543
Model 30.00030.00002760.00310.00019430.00380.0028876
0.5FS0.00330.000060.00250.000050.00130.00004
Model 10.03280.00023210.00170.00004110.03340.0002410
Model 20.01810.01162430.00170.00004110.02880.0002321
Model 30.00020.00002760.00300.00010650.01940.0011665
0.7FS0.00210.000050.00190.000060.00050.00002
Model 10.01970.00019210.00110.00004110.01930.0002010
Model 20.01580.00038430.01600.00024210.01930.0002010
Model 30.00080.00011760.00000.00000760.01460.0006243
0.9FS0.00140.000070.00120.000070.00020.00001
Model 10.02450.00040210.04580.00114210.00020.0000100
Model 20.01290.00018540.01580.00022320.00150.0025300
Model 30.00020.00006760.00000.00000760.00550.0004021
0.95FS0.00770.000500.00670.000610.00020.00001
Model 10.05930.00060210.06160.00069210.00020.0000100
Model 20.01300.00081430.02020.00039320.00020.0000100
Model 30.00010.00003760.00320.00331760.00300.0000710
0.99FS0.39750.001420.40090.001400.00010.00001
Model 10.39890.00139100.40120.00137100.00010.0000100
Model 20.12890.00503310.12320.00950310.00010.0000100
Model 30.14010.00283820.13300.004811010.00010.0000100
TABLE 1

Mean and SD of the difference in Brier score (BS) for Fellegi–Sunter (FS) model and conditional dependence latent class models (Model 1: pseudo-R2 1%; Model 2: pseudo-R2 0.5%; Model 3: pseudo-R2 0.1%) relative to the true model and median number of interactions selected in Scenario I, II and III

True match prevalenceModelScenario IScenario IIScenario III
BSInteractionsBSInteractionsBSInteractions
MeanSDTotalCorrectMeanSDTotalCorrectMeanSDTotalCorrect
0.01FS0.12690.001000.00020.000020.12980.00100
Model 10.12690.00100000.00020.00002000.12980.0010000
Model 20.12610.00142000.00020.00002000.12760.0009510
Model 30.12530.00097100.00020.00002000.12760.0009510
0.05FS0.06860.000760.00060.000020.07130.00075
Model 10.09390.00068100.00060.00002000.09670.0006610
Model 20.09390.00068100.00060.00002000.09670.0006610
Model 30.06770.00661520.00040.00002110.07190.0034252
0.1FS0.01180.003450.00090.000030.00850.00252
Model 10.05730.00725210.00090.00003000.05770.0038821
Model 20.03340.00082430.00090.00003000.04760.0012243
Model 30.03510.03873750.00070.00003110.01100.0097576
0.3FS0.00430.000070.00230.000050.00250.00007
Model 10.04270.00089210.00230.00005000.04270.0028221
Model 20.01370.00056540.00180.00005110.03570.0009543
Model 30.00030.00002760.00310.00019430.00380.0028876
0.5FS0.00330.000060.00250.000050.00130.00004
Model 10.03280.00023210.00170.00004110.03340.0002410
Model 20.01810.01162430.00170.00004110.02880.0002321
Model 30.00020.00002760.00300.00010650.01940.0011665
0.7FS0.00210.000050.00190.000060.00050.00002
Model 10.01970.00019210.00110.00004110.01930.0002010
Model 20.01580.00038430.01600.00024210.01930.0002010
Model 30.00080.00011760.00000.00000760.01460.0006243
0.9FS0.00140.000070.00120.000070.00020.00001
Model 10.02450.00040210.04580.00114210.00020.0000100
Model 20.01290.00018540.01580.00022320.00150.0025300
Model 30.00020.00006760.00000.00000760.00550.0004021
0.95FS0.00770.000500.00670.000610.00020.00001
Model 10.05930.00060210.06160.00069210.00020.0000100
Model 20.01300.00081430.02020.00039320.00020.0000100
Model 30.00010.00003760.00320.00331760.00300.0000710
0.99FS0.39750.001420.40090.001400.00010.00001
Model 10.39890.00139100.40120.00137100.00010.0000100
Model 20.12890.00503310.12320.00950310.00010.0000100
Model 30.14010.00283820.13300.004811010.00010.0000100
True match prevalenceModelScenario IScenario IIScenario III
BSInteractionsBSInteractionsBSInteractions
MeanSDTotalCorrectMeanSDTotalCorrectMeanSDTotalCorrect
0.01FS0.12690.001000.00020.000020.12980.00100
Model 10.12690.00100000.00020.00002000.12980.0010000
Model 20.12610.00142000.00020.00002000.12760.0009510
Model 30.12530.00097100.00020.00002000.12760.0009510
0.05FS0.06860.000760.00060.000020.07130.00075
Model 10.09390.00068100.00060.00002000.09670.0006610
Model 20.09390.00068100.00060.00002000.09670.0006610
Model 30.06770.00661520.00040.00002110.07190.0034252
0.1FS0.01180.003450.00090.000030.00850.00252
Model 10.05730.00725210.00090.00003000.05770.0038821
Model 20.03340.00082430.00090.00003000.04760.0012243
Model 30.03510.03873750.00070.00003110.01100.0097576
0.3FS0.00430.000070.00230.000050.00250.00007
Model 10.04270.00089210.00230.00005000.04270.0028221
Model 20.01370.00056540.00180.00005110.03570.0009543
Model 30.00030.00002760.00310.00019430.00380.0028876
0.5FS0.00330.000060.00250.000050.00130.00004
Model 10.03280.00023210.00170.00004110.03340.0002410
Model 20.01810.01162430.00170.00004110.02880.0002321
Model 30.00020.00002760.00300.00010650.01940.0011665
0.7FS0.00210.000050.00190.000060.00050.00002
Model 10.01970.00019210.00110.00004110.01930.0002010
Model 20.01580.00038430.01600.00024210.01930.0002010
Model 30.00080.00011760.00000.00000760.01460.0006243
0.9FS0.00140.000070.00120.000070.00020.00001
Model 10.02450.00040210.04580.00114210.00020.0000100
Model 20.01290.00018540.01580.00022320.00150.0025300
Model 30.00020.00006760.00000.00000760.00550.0004021
0.95FS0.00770.000500.00670.000610.00020.00001
Model 10.05930.00060210.06160.00069210.00020.0000100
Model 20.01300.00081430.02020.00039320.00020.0000100
Model 30.00010.00003760.00320.00331760.00300.0000710
0.99FS0.39750.001420.40090.001400.00010.00001
Model 10.39890.00139100.40120.00137100.00010.0000100
Model 20.12890.00503310.12320.00950310.00010.0000100
Model 30.14010.00283820.13300.004811010.00010.0000100
TABLE 2

Mean and SD of the difference in Brier score (BS) for Fellegi–Sunter (FS) model and conditional dependence latent class models (Model 1: pseudo-R2 1%; Model 2: pseudo-R2 0.5%; Model 3: pseudo-R2 0.1%;) relative to the true model and median number of interactions selected in Scenario IV

Match
prevalence
ModelBSInteractions selected
MeanSDTotalCorrect
0.01FS0.00620.00017
Model 10.00080.0000511
Model 20.00480.0001421
Model 30.00910.0007883
0.05FS0.00120.00004
Model 10.00110.0009711
Model 20.00130.0000443
Model 30.00030.0000276
0.10FS0.00160.00004
Model 10.00110.0000311
Model 20.00130.0000643
Model 30.00030.0001276
0.30FS0.00290.00006
Model 10.00700.0001121
Model 20.00400.0000843
Model 30.00160.0000776
0.50FS0.00420.00007
Model 10.00600.0005132
Model 20.00460.0001954
Model 30.00020.0001276
0.70FS0.00580.00009
Model 10.00710.0001032
Model 20.00270.0016954
Model 30.00030.0001165
0.90FS0.00890.00012
Model 10.00910.0001332
Model 20.00820.0001843
Model 30.00080.0001065
0.95FS0.02550.00035
Model 10.02780.0002921
Model 20.01820.0002652
Model 30.01050.00985105
0.99FS0.06560.00043
Model 10.06830.0004110
Model 20.06830.0004110
Model 30.06220.0008572
Match
prevalence
ModelBSInteractions selected
MeanSDTotalCorrect
0.01FS0.00620.00017
Model 10.00080.0000511
Model 20.00480.0001421
Model 30.00910.0007883
0.05FS0.00120.00004
Model 10.00110.0009711
Model 20.00130.0000443
Model 30.00030.0000276
0.10FS0.00160.00004
Model 10.00110.0000311
Model 20.00130.0000643
Model 30.00030.0001276
0.30FS0.00290.00006
Model 10.00700.0001121
Model 20.00400.0000843
Model 30.00160.0000776
0.50FS0.00420.00007
Model 10.00600.0005132
Model 20.00460.0001954
Model 30.00020.0001276
0.70FS0.00580.00009
Model 10.00710.0001032
Model 20.00270.0016954
Model 30.00030.0001165
0.90FS0.00890.00012
Model 10.00910.0001332
Model 20.00820.0001843
Model 30.00080.0001065
0.95FS0.02550.00035
Model 10.02780.0002921
Model 20.01820.0002652
Model 30.01050.00985105
0.99FS0.06560.00043
Model 10.06830.0004110
Model 20.06830.0004110
Model 30.06220.0008572
TABLE 2

Mean and SD of the difference in Brier score (BS) for Fellegi–Sunter (FS) model and conditional dependence latent class models (Model 1: pseudo-R2 1%; Model 2: pseudo-R2 0.5%; Model 3: pseudo-R2 0.1%;) relative to the true model and median number of interactions selected in Scenario IV

Match
prevalence
ModelBSInteractions selected
MeanSDTotalCorrect
0.01FS0.00620.00017
Model 10.00080.0000511
Model 20.00480.0001421
Model 30.00910.0007883
0.05FS0.00120.00004
Model 10.00110.0009711
Model 20.00130.0000443
Model 30.00030.0000276
0.10FS0.00160.00004
Model 10.00110.0000311
Model 20.00130.0000643
Model 30.00030.0001276
0.30FS0.00290.00006
Model 10.00700.0001121
Model 20.00400.0000843
Model 30.00160.0000776
0.50FS0.00420.00007
Model 10.00600.0005132
Model 20.00460.0001954
Model 30.00020.0001276
0.70FS0.00580.00009
Model 10.00710.0001032
Model 20.00270.0016954
Model 30.00030.0001165
0.90FS0.00890.00012
Model 10.00910.0001332
Model 20.00820.0001843
Model 30.00080.0001065
0.95FS0.02550.00035
Model 10.02780.0002921
Model 20.01820.0002652
Model 30.01050.00985105
0.99FS0.06560.00043
Model 10.06830.0004110
Model 20.06830.0004110
Model 30.06220.0008572
Match
prevalence
ModelBSInteractions selected
MeanSDTotalCorrect
0.01FS0.00620.00017
Model 10.00080.0000511
Model 20.00480.0001421
Model 30.00910.0007883
0.05FS0.00120.00004
Model 10.00110.0009711
Model 20.00130.0000443
Model 30.00030.0000276
0.10FS0.00160.00004
Model 10.00110.0000311
Model 20.00130.0000643
Model 30.00030.0001276
0.30FS0.00290.00006
Model 10.00700.0001121
Model 20.00400.0000843
Model 30.00160.0000776
0.50FS0.00420.00007
Model 10.00600.0005132
Model 20.00460.0001954
Model 30.00020.0001276
0.70FS0.00580.00009
Model 10.00710.0001032
Model 20.00270.0016954
Model 30.00030.0001165
0.90FS0.00890.00012
Model 10.00910.0001332
Model 20.00820.0001843
Model 30.00080.0001065
0.95FS0.02550.00035
Model 10.02780.0002921
Model 20.01820.0002652
Model 30.01050.00985105
0.99FS0.06560.00043
Model 10.06830.0004110
Model 20.06830.0004110
Model 30.06220.0008572

In Scenario I with conditional dependence in both match and non-match classes, the FS model performs reasonably well when the true match prevalence is between 0.3 and 0.95. On average, the BS for the FS model is only slightly higher than that of the true model. Among the three conditional dependence latent class models, match performance is the worst for Model 1 that only includes interactions with pseudo-R2>1% but improves substantially as more interactions are selected. When interactions with pseudo-R2>0.1% are selected, Model 3 performs better than the FS model and is only negligibly worse than the true model. The performance of the conditional dependence latent class models can be explained by the conditional dependence structure that is identified. Model 1 generally selects two interactions to include in the model, but only one is selected correctly. Model 2 identifies four to five interactions, which include three to four important interactions. Therefore both models identify an incorrect conditional dependence structure. Model 3, on the other hand, selects all six important interactions and therefore identifies the correct conditional dependence structure. When the true match prevalence approaches zero or one (<=0.1 or =0.99), the FS model performs poorly with a much larger BS than the true model. The score test cannot identify the correct conditional dependence structure regardless of what threshold is used for the pseudo-R2. Consequently, all conditional dependence latent class models performs worse than the true model, although Model 3 usually has a comparable or better performance than the FS model.

In Scenario II where the conditional dependence exists only in the match class, all models have similar match performance when the true match prevalence is lower than 0.5. As the true match prevalence increases, the effect of the conditional dependence becomes more prominent. This can be seen by the smaller BS of the Model 3 relative to the FS model, although the FS model generally performs well. Again, we observe that Model 3 is able to identify the correct conditional dependence structure except when the true match prevalence reaches 0.99. With the extremely large match prevalence of 0.99, all models perform poorly with the FS model and Model 1 having the worst performance. Their BS are 0.4 points greater than the true model on average. Although Model 2 and Model 3 do not have satisfactory performance, they perform relatively better with a smaller BS. The poor performance of the models can be explained by the incorrect conditional dependence structure identified.

Findings in Scenario III where conditional dependence only exists in the non-match class are similar to those in Scenario II. All models perform similarly well when the true match prevalence is large. As the true match prevalence goes down to 0.05 or 0.01 with vast majority of the record pairs being non-matches, ignoring the conditional dependence results in a poorly performing FS model. However, due to the imbalance of the match and non-match classes, identification of the correct conditional dependence structure is challenging. None of the six interactions is correctly identified in the conditional dependence latent class models. Consequently, these models show a poor performance.

In Scenario IV with highly discriminating fields and conditional dependence in both match and non-match classes, we see similar results as in Scenario I, except that the BS is consistently smaller. This is expected since matching fields have greater discriminating power. Using Model 3, the conditional dependence structure is identified correctly even when the match prevalence is as low as 0.05. When the true match prevalence is 0.01, Model 3 selects eight interactions, of which three were identified correctly. This shows that the proposed score test is more powerful when highly discriminating fields are used for linkage since only one incorrect interaction is identified in Scenario I. In addition, all models perform similarly well when the true match prevalence is small. On the other hand, when the true match prevalence is extremely large at 0.95 or 0.99, the performance of the FS model suffers, while incorporating the conditional dependence identified by the proposed method yields comparable or better performance.

4 NEWBORN SCREENING DATA DEDUPLICATION

We now apply the proposed method to a real-world linkage example to identify conditional dependence and evaluate the performance of the conditional dependence latent class model relative to the FS model. In this application, a total of 765,814 Health Level 7 (HL7) messages sent to the state public health programme for newborn screening (NBS) results of patients less than 2 months of age in 2017 were extracted from a local Health Information Exchange (HIE) and deduplicated. Fields available for linkage include medical record number (MRN), patient's first name (FN), middle initial (MI), and last name (LN), sex, telephone number (TEL), street address (ADR), zip code (ZIP), date of birth (DOB), and next of kin's first name (NKFN) and last name (NKLN). With more than 300 billion pairs of records to form and compare, we applied five blocking schemes to reduce the number of comparisons: MRN, LN and FN (LN-FN), date of birth and zip code (DOB-ZIP), NKLN and first name (NKLN-NKFN) and TEL. These five blocking schemes contained 9.6 million record pairs, from which a random sample of 15,000 record pairs were selected and manually reviewed. A total of 7967 (53.1%) record pairs in the manual review sample were found to be true matches.

The proposed method is applied to the five blocking schemes separately. As in the simulation study, three conditional dependence latent class models with increasing complexity are considered in addition to the FS model. Thresholds for the pseudo-R2 are 1% for Model 1, 0.5% for Model 2, and 0.1% for Model 3. The BS is calculated based on the manual review sample to evaluate the match performance. Since the goal of record linkage is to classify record pairs as matches or non-matches, we also provide the F-score for each model. The F-score is the harmonic mean of sensitivity and positive predictive value (PPV) and therefore requires dichotomizing the model predicted probabilities p^i's. The threshold for dichotomization is selected so that the proportion of record pairs with model predicted probabilities above the threshold is equal to the model estimated match prevalence.

4.1 The MRN blocking scheme

The MRN blocking scheme captures roughly 4.2 million record pairs. Seven matching fields are used in latent class models: patient's LN, FN, MI, sex, TEL, ADR and ZIP. Date of birth is not used due to the high agreement rate of 99.7%. For each of the 21 pairs of matching fields, score test statistics are calculated for the three types of conditional dependence structures (Table 3). Conditional dependence between ADR and ZIP is the strongest, revealing the largest improvement in model fit with a pseudo-R2 of 2.53% when allowing a constant conditional dependency parameter. This is followed by the conditional dependence between LN and MI, which has a pseudo-R2 of 1.39% when incorporating a conditional dependency parameter in the match class only. These two interactions are included in Model 1. In Model 2, conditional dependence with pseudo-R2 of 0.5% or above is considered so two additional parameters including the constant conditional dependency parameters for MI and ADR and for MI and ZIP. Model 3 includes conditional dependence with a pseudo-R2 of 0.1% or above so this is the most complicated model with 12 conditional dependency parameters.

TABLE 3

Score test statistics for the medical record number (MRN) blocking scheme of the newborn screening data deduplication

Pair numberField 1Field 2Score test statisticsMaximum statisticPseudo-R2
Constant dependencyNon-match onlyMatch only
1ADRZIP213,890.2212,906.5183,954.8213,890.22.53%
2LNMI98,555.978,449.0117,313.5117,313.51.39%
3MIADR46,212.145,591.727,248.946,212.10.55%
4MIZIP42,652.342,060.125,456.742,652.30.50%
5LNZIP41,103.240,399.717,641.741,103.20.49%
6LNADR39,428.538,686.218,036.939,428.50.47%
7LNFN16,546.015,678.48932.916,546.00.20%
8FNMI12,568.812,365.75110.712,568.80.15%
9TELADR10,189.310,803.757.610,803.70.13%
10FNZIP9778.19685.37392.19778.10.12%
11FNADR9198.49109.97215.69198.40.11%
12TELZIP7399.38419.83811.58419.80.10%
13LNTEL7230.36228.03700.67230.30.09%
14FNTEL1172.5988.81471.21471.20.02%
15MITEL1211.11001.0861.51211.10.01%
16FNSEX1198.01154.6475.71198.00.01%
17SEXZIP9.712.698.598.50.00%
18SEXADR9.712.696.596.50.00%
19SEXMI86.689.77.789.70.00%
20LNSEX28.664.340.464.30.00%
21SEXTEL0.00.23.13.10.00%
Pair numberField 1Field 2Score test statisticsMaximum statisticPseudo-R2
Constant dependencyNon-match onlyMatch only
1ADRZIP213,890.2212,906.5183,954.8213,890.22.53%
2LNMI98,555.978,449.0117,313.5117,313.51.39%
3MIADR46,212.145,591.727,248.946,212.10.55%
4MIZIP42,652.342,060.125,456.742,652.30.50%
5LNZIP41,103.240,399.717,641.741,103.20.49%
6LNADR39,428.538,686.218,036.939,428.50.47%
7LNFN16,546.015,678.48932.916,546.00.20%
8FNMI12,568.812,365.75110.712,568.80.15%
9TELADR10,189.310,803.757.610,803.70.13%
10FNZIP9778.19685.37392.19778.10.12%
11FNADR9198.49109.97215.69198.40.11%
12TELZIP7399.38419.83811.58419.80.10%
13LNTEL7230.36228.03700.67230.30.09%
14FNTEL1172.5988.81471.21471.20.02%
15MITEL1211.11001.0861.51211.10.01%
16FNSEX1198.01154.6475.71198.00.01%
17SEXZIP9.712.698.598.50.00%
18SEXADR9.712.696.596.50.00%
19SEXMI86.689.77.789.70.00%
20LNSEX28.664.340.464.30.00%
21SEXTEL0.00.23.13.10.00%
TABLE 3

Score test statistics for the medical record number (MRN) blocking scheme of the newborn screening data deduplication

Pair numberField 1Field 2Score test statisticsMaximum statisticPseudo-R2
Constant dependencyNon-match onlyMatch only
1ADRZIP213,890.2212,906.5183,954.8213,890.22.53%
2LNMI98,555.978,449.0117,313.5117,313.51.39%
3MIADR46,212.145,591.727,248.946,212.10.55%
4MIZIP42,652.342,060.125,456.742,652.30.50%
5LNZIP41,103.240,399.717,641.741,103.20.49%
6LNADR39,428.538,686.218,036.939,428.50.47%
7LNFN16,546.015,678.48932.916,546.00.20%
8FNMI12,568.812,365.75110.712,568.80.15%
9TELADR10,189.310,803.757.610,803.70.13%
10FNZIP9778.19685.37392.19778.10.12%
11FNADR9198.49109.97215.69198.40.11%
12TELZIP7399.38419.83811.58419.80.10%
13LNTEL7230.36228.03700.67230.30.09%
14FNTEL1172.5988.81471.21471.20.02%
15MITEL1211.11001.0861.51211.10.01%
16FNSEX1198.01154.6475.71198.00.01%
17SEXZIP9.712.698.598.50.00%
18SEXADR9.712.696.596.50.00%
19SEXMI86.689.77.789.70.00%
20LNSEX28.664.340.464.30.00%
21SEXTEL0.00.23.13.10.00%
Pair numberField 1Field 2Score test statisticsMaximum statisticPseudo-R2
Constant dependencyNon-match onlyMatch only
1ADRZIP213,890.2212,906.5183,954.8213,890.22.53%
2LNMI98,555.978,449.0117,313.5117,313.51.39%
3MIADR46,212.145,591.727,248.946,212.10.55%
4MIZIP42,652.342,060.125,456.742,652.30.50%
5LNZIP41,103.240,399.717,641.741,103.20.49%
6LNADR39,428.538,686.218,036.939,428.50.47%
7LNFN16,546.015,678.48932.916,546.00.20%
8FNMI12,568.812,365.75110.712,568.80.15%
9TELADR10,189.310,803.757.610,803.70.13%
10FNZIP9778.19685.37392.19778.10.12%
11FNADR9198.49109.97215.69198.40.11%
12TELZIP7399.38419.83811.58419.80.10%
13LNTEL7230.36228.03700.67230.30.09%
14FNTEL1172.5988.81471.21471.20.02%
15MITEL1211.11001.0861.51211.10.01%
16FNSEX1198.01154.6475.71198.00.01%
17SEXZIP9.712.698.598.50.00%
18SEXADR9.712.696.596.50.00%
19SEXMI86.689.77.789.70.00%
20LNSEX28.664.340.464.30.00%
21SEXTEL0.00.23.13.10.00%

The match performance of the models is shown in Table 4. Of the 15,000 randomly selected and reviewed record pairs, 6487 pairs are in the MRN blocking scheme and 6000 (92.5%) are true matches. The BS is 0.054 for Model 1, 0.056 for Model 2, and 0.080 for Model 3, all smaller than the BS of 0.086 for the FS model. This shows that the conditional dependence models perform better than the FS model. Dichotomising the estimated match probabilities using the estimated match prevalence, the F-scores show a similar pattern. All conditional dependence models achieve better F-scores than the FS model, with greater than 2% improvement for Model 1 and Model 2 and 1% improvement for Model 3.

TABLE 4

Match performance of the latent class models for the newborn screening data deduplication

ModelBrier scoreTrue negativeTrue positiveFalse negativeFalse positiveSensitivityPositive predictive valueF-score
MRN blocking scheme
FS0.0864665248752210.8750.9960.931
Model 10.0544575524476300.9210.9950.956
Model 20.0564535521479340.9200.9940.956
Model 30.0804605357643270.8930.9950.941
LN-FN blocking scheme
FS0.149108373286000.8131.0000.897
Model 10.128108382376900.8331.0000.909
Model 20.121108386173100.8411.0000.914
Model 30.139107370888410.8071.0000.893
DOB-ZIP blocking scheme
FS0.071571158892497940.9590.8810.919
Model 10.062577759961427280.9770.8920.932
Model 20.06856516081578540.9910.8770.930
Model 30.07255876090489180.9920.8690.927
TEL blocking scheme
FS0.13844029605152230.8520.9300.889
Model 10.13431331433323500.9040.9000.902
Model 20.13431331433323500.9040.9000.902
Model 30.14531031213543530.8980.8980.898
NKLN-NKFN blocking scheme
FS0.1531671431262800.9820.8360.903
Model 10.159143145433040.9980.8270.905
Model 20.160140145523070.9990.8260.904
Model 30.1142481419381990.9740.8770.923
All blocking schemes combined
FS0.100616870109578650.8800.8900.885
Model 10.085616772667018660.9120.8940.903
Model 20.089606273436249710.9220.8830.902
Model 30.0966019725970810140.9110.8770.894
ModelBrier scoreTrue negativeTrue positiveFalse negativeFalse positiveSensitivityPositive predictive valueF-score
MRN blocking scheme
FS0.0864665248752210.8750.9960.931
Model 10.0544575524476300.9210.9950.956
Model 20.0564535521479340.9200.9940.956
Model 30.0804605357643270.8930.9950.941
LN-FN blocking scheme
FS0.149108373286000.8131.0000.897
Model 10.128108382376900.8331.0000.909
Model 20.121108386173100.8411.0000.914
Model 30.139107370888410.8071.0000.893
DOB-ZIP blocking scheme
FS0.071571158892497940.9590.8810.919
Model 10.062577759961427280.9770.8920.932
Model 20.06856516081578540.9910.8770.930
Model 30.07255876090489180.9920.8690.927
TEL blocking scheme
FS0.13844029605152230.8520.9300.889
Model 10.13431331433323500.9040.9000.902
Model 20.13431331433323500.9040.9000.902
Model 30.14531031213543530.8980.8980.898
NKLN-NKFN blocking scheme
FS0.1531671431262800.9820.8360.903
Model 10.159143145433040.9980.8270.905
Model 20.160140145523070.9990.8260.904
Model 30.1142481419381990.9740.8770.923
All blocking schemes combined
FS0.100616870109578650.8800.8900.885
Model 10.085616772667018660.9120.8940.903
Model 20.089606273436249710.9220.8830.902
Model 30.0966019725970810140.9110.8770.894
TABLE 4

Match performance of the latent class models for the newborn screening data deduplication

ModelBrier scoreTrue negativeTrue positiveFalse negativeFalse positiveSensitivityPositive predictive valueF-score
MRN blocking scheme
FS0.0864665248752210.8750.9960.931
Model 10.0544575524476300.9210.9950.956
Model 20.0564535521479340.9200.9940.956
Model 30.0804605357643270.8930.9950.941
LN-FN blocking scheme
FS0.149108373286000.8131.0000.897
Model 10.128108382376900.8331.0000.909
Model 20.121108386173100.8411.0000.914
Model 30.139107370888410.8071.0000.893
DOB-ZIP blocking scheme
FS0.071571158892497940.9590.8810.919
Model 10.062577759961427280.9770.8920.932
Model 20.06856516081578540.9910.8770.930
Model 30.07255876090489180.9920.8690.927
TEL blocking scheme
FS0.13844029605152230.8520.9300.889
Model 10.13431331433323500.9040.9000.902
Model 20.13431331433323500.9040.9000.902
Model 30.14531031213543530.8980.8980.898
NKLN-NKFN blocking scheme
FS0.1531671431262800.9820.8360.903
Model 10.159143145433040.9980.8270.905
Model 20.160140145523070.9990.8260.904
Model 30.1142481419381990.9740.8770.923
All blocking schemes combined
FS0.100616870109578650.8800.8900.885
Model 10.085616772667018660.9120.8940.903
Model 20.089606273436249710.9220.8830.902
Model 30.0966019725970810140.9110.8770.894
ModelBrier scoreTrue negativeTrue positiveFalse negativeFalse positiveSensitivityPositive predictive valueF-score
MRN blocking scheme
FS0.0864665248752210.8750.9960.931
Model 10.0544575524476300.9210.9950.956
Model 20.0564535521479340.9200.9940.956
Model 30.0804605357643270.8930.9950.941
LN-FN blocking scheme
FS0.149108373286000.8131.0000.897
Model 10.128108382376900.8331.0000.909
Model 20.121108386173100.8411.0000.914
Model 30.139107370888410.8071.0000.893
DOB-ZIP blocking scheme
FS0.071571158892497940.9590.8810.919
Model 10.062577759961427280.9770.8920.932
Model 20.06856516081578540.9910.8770.930
Model 30.07255876090489180.9920.8690.927
TEL blocking scheme
FS0.13844029605152230.8520.9300.889
Model 10.13431331433323500.9040.9000.902
Model 20.13431331433323500.9040.9000.902
Model 30.14531031213543530.8980.8980.898
NKLN-NKFN blocking scheme
FS0.1531671431262800.9820.8360.903
Model 10.159143145433040.9980.8270.905
Model 20.160140145523070.9990.8260.904
Model 30.1142481419381990.9740.8770.923
All blocking schemes combined
FS0.100616870109578650.8800.8900.885
Model 10.085616772667018660.9120.8940.903
Model 20.089606273436249710.9220.8830.902
Model 30.0966019725970810140.9110.8770.894

4.2 The LN-FN blocking scheme

The LN-FN blocking scheme contains roughly 3 million record pairs. Six matching fields are used: MRN, MI, DOB, TEL, ADR and ZIP. Sex is not used as a matching field in the models due to its high agreement rate of 99.5%. Table 5 shows the score test statistics for each of the 15 pairs of matching fields. ADR and ZIP again show the strongest conditional dependence with a pseudo-R2 of 1.16%. Model 1 therefore includes one conditional dependency parameter for the interaction between two fields for the match class only. Two additional interactions are included in Model 2: One between date of birth and zip code in the non-match class only and the other between MRN and MI with a constant conditional dependency parameter. Both interactions are associated with a pseudo-R2 above 0.5%. Model 3 includes 10 interactions whose pseudo-R2 is above 0.1%.

TABLE 5

Score test statistics for the last name and first name (LN-FN) blocking scheme of the newborn screening data deduplication

Pair numberField 1Field 2Score test statisticsMaximum statisticPseudo- R2
Constant dependencyNon-match onlyMatch only
1ADRZIP6655.64937.791,699.491,699.41.16%
2DOBZIP70,376.470,377.4468.970,377.40.89%
3MRNMI43,615.137,516.714,816.643,615.10.55%
4DOBTEL32,616.932,618.2230.232,618.20.41%
5DOBADR20,311.720,312.21748.920,312.20.26%
6MRNZIP16,383.418,327.0156.918,327.00.23%
7MIDOB17,258.317,258.690.917,258.60.22%
8MRNDOB10,463.110,464.92753.310,464.90.13%
9MRNADR2108.34003.510,006.010,006.00.13%
10MRNTEL7617.47172.41002.37617.40.10%
11MIADR3604.93385.93427.43604.90.05%
12TELZIP2988.73419.4670.63419.40.04%
13TELADR1002.21108.5135.11108.50.01%
14MITEL836.41075.81006.71075.80.01%
15MIZIP49.749.90.249.90.00%
Pair numberField 1Field 2Score test statisticsMaximum statisticPseudo- R2
Constant dependencyNon-match onlyMatch only
1ADRZIP6655.64937.791,699.491,699.41.16%
2DOBZIP70,376.470,377.4468.970,377.40.89%
3MRNMI43,615.137,516.714,816.643,615.10.55%
4DOBTEL32,616.932,618.2230.232,618.20.41%
5DOBADR20,311.720,312.21748.920,312.20.26%
6MRNZIP16,383.418,327.0156.918,327.00.23%
7MIDOB17,258.317,258.690.917,258.60.22%
8MRNDOB10,463.110,464.92753.310,464.90.13%
9MRNADR2108.34003.510,006.010,006.00.13%
10MRNTEL7617.47172.41002.37617.40.10%
11MIADR3604.93385.93427.43604.90.05%
12TELZIP2988.73419.4670.63419.40.04%
13TELADR1002.21108.5135.11108.50.01%
14MITEL836.41075.81006.71075.80.01%
15MIZIP49.749.90.249.90.00%
TABLE 5

Score test statistics for the last name and first name (LN-FN) blocking scheme of the newborn screening data deduplication

Pair numberField 1Field 2Score test statisticsMaximum statisticPseudo- R2
Constant dependencyNon-match onlyMatch only
1ADRZIP6655.64937.791,699.491,699.41.16%
2DOBZIP70,376.470,377.4468.970,377.40.89%
3MRNMI43,615.137,516.714,816.643,615.10.55%
4DOBTEL32,616.932,618.2230.232,618.20.41%
5DOBADR20,311.720,312.21748.920,312.20.26%
6MRNZIP16,383.418,327.0156.918,327.00.23%
7MIDOB17,258.317,258.690.917,258.60.22%
8MRNDOB10,463.110,464.92753.310,464.90.13%
9MRNADR2108.34003.510,006.010,006.00.13%
10MRNTEL7617.47172.41002.37617.40.10%
11MIADR3604.93385.93427.43604.90.05%
12TELZIP2988.73419.4670.63419.40.04%
13TELADR1002.21108.5135.11108.50.01%
14MITEL836.41075.81006.71075.80.01%
15MIZIP49.749.90.249.90.00%
Pair numberField 1Field 2Score test statisticsMaximum statisticPseudo- R2
Constant dependencyNon-match onlyMatch only
1ADRZIP6655.64937.791,699.491,699.41.16%
2DOBZIP70,376.470,377.4468.970,377.40.89%
3MRNMI43,615.137,516.714,816.643,615.10.55%
4DOBTEL32,616.932,618.2230.232,618.20.41%
5DOBADR20,311.720,312.21748.920,312.20.26%
6MRNZIP16,383.418,327.0156.918,327.00.23%
7MIDOB17,258.317,258.690.917,258.60.22%
8MRNDOB10,463.110,464.92753.310,464.90.13%
9MRNADR2108.34003.510,006.010,006.00.13%
10MRNTEL7617.47172.41002.37617.40.10%
11MIADR3604.93385.93427.43604.90.05%
12TELZIP2988.73419.4670.63419.40.04%
13TELADR1002.21108.5135.11108.50.01%
14MITEL836.41075.81006.71075.80.01%
15MIZIP49.749.90.249.90.00%

A total of 4700 record pairs in the manual review sample are in the LN-FN blocking scheme, of which 4592 (97.7%) are true matches. The BS is 0.128 for Model 1, 0.121 for Model 2 and 0.139 for Model 3. As in the MRN blocking scheme, all conditional dependence models perform better than the FS model, which has a BS of 0.149. Dichotomising the match probabilities using the estimated match prevalence, Model 2 has the greatest F-score with a 1.7% improvement compared to the FS model.

4.3 Other blocking schemes

In the other three blocking schemes, there are approximately 8 million (DOB-ZIP), 2.6 million (TEL) and 1.2 million (NKLN-NKFN) record pairs. The score test statistics are shown in the Appendix (Tables A1,A2,A3). In the DOB-ZIP blocking scheme, the BS is 0.062 for Model 1, 0.068 for Model 2 and 0.072 for Model 3, showing comparable or better performance than the FS model (BS = 0.071). In the TEL blocking scheme, Model 1 and Model 2 include the same interaction terms and achieve a BS of 0.134, yielding a better performance than the FS model. Model 3 has a BS of 0.145, which is slightly larger than that of the FS model. In the NKLN-NKFN blocking scheme, the BS of Model 1 and Model 2 is larger than that of the FS model. Model 3, however, produces a BS of 0.114, smaller than the BS of 0.153 for the FS model.

4.4 Overall results

In record linkage, results across blocking schemes are usually combined to classify a record pair as a match if it is determined to be a match in at least one blocking scheme. We use the same decision rule when evaluating the overall F-score. For the calculation of the overall BS, we use the maximum estimated match probability for any record pair that falls in multiple blocking schemes. The results are shown Table 4. The BS is 0.085 for Model 1, 0.089 for Model 2, and 0.096 for Model 3. Compared to the FS model with a BS of 0.1, accommodating the conditional dependence produces improvement in match performance. Similar results can be seen in F-scores.

5 DISCUSSION

The FS model is widely used in probabilistic record linkage despite its often invalid assumption of conditional independence. Prior research has demonstrated its impaired performance when conditional dependence exists, as well as the potential gain in matching accuracy when conditional dependence latent class models are used (Xu et al., 2019). However, the success of the conditional dependence models is heavily dependent on the use of correct conditional dependence structure (Li et al., 2018). Existing approaches for the identification of the conditional dependence structure, including the correlation residual plot, the log-odds ratio check, and the bivariate residual approach, have been shown to have poor performance (Oberski et al., 2013; Subtil et al., 2012). Alternatively, Oberski et al. (2013) proposed the bootstrap bivariate residual approach and the score test approach, both of which produce adequate performance, with the score test showing greater appeal due to the computational convenience. In this paper, we extend Oberski et al. (2013)'s score test approach to accommodate more dependence structures. Through the simulation study and the real-world linkage application, the proposed approach is found to be successful. Based on the findings in the simulation study, we recommend to use a threshold of 0.1% for the pseudo-R2 and include all interactions meeting the threshold in the conditional dependence model. This model is shown to correctly identify the conditional dependence structure in many settings, resulting in comparable or better match performance than the FS model.

The proposed score-based tests handle three types of conditional dependence structure. This test can be easily extended to evaluate the conditional dependence in the match and non-match classes simultaneously while allowing the conditional dependency parameters to be different between classes. The two-dimensional score vector is formed by the scores in (8) and (9) where conditional dependence lies only in one class, with the Fisher Information matrix similarly derived. Although this model accommodates a more flexible conditional dependence structure, it does not provide substantially improved matching performance relative to the proposed method in our simulation study as it identifies a similar number of correct interactions. For example, in Scenario I of the simulation study with 0.01 match prevalence, this score test identifies only one interaction, which is incorrect, in all simulated data sets when using the threshold of 0.1% for pseudo-R2.

The poor performance of the proposed method is seen when the true match prevalence is close to 0% or 100%. This is expected since the score test is derived based on the FS model that may produce biassed parameter estimates. In record linkage, the FS model is known to have poor performance when there is a lack of overlap between two linked data files (Winkler, 2014), resulting in extremely small match prevalence. Prior research has also demonstrated the poor performance of the FS model when the match prevalence is extremely large (Xu et al., 2019). Although our proposed method generally produces a conditional dependence latent class model that works similarly or better than the FS model, its performance is not optimal when the true match prevalence is extremely small or extremely large. We therefore recommend that blocking schemes are selected to produce less extreme and more balanced data. Furthermore, the FS model produces severely biassed parameter estimates when match prevalence is extreme and conditional dependence exists (Xu et al., 2019). The large biases in the parameter estimates produce biassed score test statistics, leading to incorrect identification of the conditional dependence structure. In Scenario I of our simulation, Model 3 correctly identifies the conditional dependence structure in 100% of the simulated data sets with prevalence of 0.01 and 0.05 and 81% of the simulated data sets with prevalence of 0.99 when the score test statistics are derived using the true values of the model parameters (prevalence, m- and u-probabilities). With the correct conditional dependence structure, the average BS of Model 3 is substantially reduced. This emphasises the importance of conducting a manual review to evaluate the gold standard match status of record pairs, which is helpful in correcting the biases in parameter estimates. In record linkage, manual review ascertaining the true match status is often performed for a subset of record pairs for various reasons. Manual review results, however, are rarely used in linkage models to improve parameter estimation or match performance. We are currently investigating approaches to incorporating manual review results in the proposed method and we believe that this hybrid approach may yield a substantial improvement in the correct identification of the conditional dependence structure and consequently enhanced match performance.

Another possible strategy to select the conditional dependence is to include interactions sequentially. Restrictions on the independence of fields are relaxed for one pair of fields at a time and the score test is computed based on the model assuming the independence of the two fields. If the score test statistic is larger than a certain threshold of the negative log-likelihood of the model, the corresponding interaction is added to the model. Although more computationally burdensome, this sequential selection strategy can be readily implemented. Future research will be performed to evaluate its performance relative to the proposed method.

DATA AVAILABILITY STATEMENT

The programming codes and the related data are available on https://huipxu.pages.iu.edu/publications.html.

ACKNOWLEDGEMENTS

This project was supported by grant numbers R01HS023808 from the Agency for Healthcare Research and Quality and ME-2017C1-6425 from the Patient-Centered Outcomes Research Institute. The content is solely the responsibility of the authors and does not necessarily represent the official views of the Agency for Healthcare Research and Quality or the Patient-Centered Outcomes Research Institute.

REFERENCES

Albert
,
P.
&
Dodd
,
L.
(
2004
)
A cautionary note on the robustness of latent class models for estimating diagnostic error without a gold standard
.
Biometrics
,
60
,
427
435
.

Armstrong
,
J.
&
Mayda
,
J.
(
1992
) Estimation of record linkage models using dependent data. In:
JSM Proceedings, Survey Research Methods Section
.
Alexandria, VA
:
American Statistical Association
, pp.
853
858
. Available from http://www.asasrms.org/Proceedings/y1992f.html [Accessed 18th August 2022].

Brier
,
G.
(
1950
)
Verification of forecasts expressed in terms of probability
.
Monthly Weather Review
,
78
,
1
3
.

Byrne
,
S.
(
2016
)
A note on the use of empirical AUC for evaluating probabilistic forcasts
.
Electronic Journal of Statistics
,
10
,
380
393
.

Clogg
,
C.
(
1995
) Chapter 6. Latent class models. In:
Arminger
,
G.
,
Clogg
,
C.
&
Sobel
,
M.E.
(Eds.)
Handbook of statistical modeling for the social and behavioral sciences
.
New York
:
Plenum
, pp.
311
359
.

Enamorado
,
T.
,
Fifield
,
B.
&
Imai
,
K.
(
2019
)
Using a probabilistic model to assist merging of large-scale administrative records
.
American Political Science Review
,
113
,
353
371
.

Engle
,
R.
(
1984
) Chapter 13. Wald, likelihood ratio, and lagrange multiplier tests in econometrics. In:
Intriligator
,
M.
&
Griliches
,
Z.
(Eds.)
Handbook of econometrics
, Vol.
2
.
North-Holland, Amsterdam
:
Elsevier
, pp.
775
826
.

Fellegi
,
I.
&
Sunter
,
A.
(
1969
)
A theory for record linkage
.
Journal of the American Statistical Association
,
64
,
1183
1210
.

Garrett
,
E.
&
Zeger
,
S.
(
2000
)
Latent class model diagnosis
.
Biometrics
,
56
,
1055
1067
.

Gneiting
,
T.
&
Raftery
,
A.
(
2007
)
Strictly proper scoring rules, prediction, and estimation
.
Journal of the American Statistical Association
,
102
,
359
378
.

Goldstein
,
H.
&
Harron
,
K.
(
2015
) Chapter 6. Record linkage: a missing data problem. In:
Harron
,
K.
,
Goldstein
,
H.
&
Dibben
,
C.
(Eds.)
Methodological developments in data linkage
.
London
:
Wiley
, pp.
109
124
.

Goodman
,
L.
(
1974
)
Exploratory latent structure analysis using both identifiable and unidentifiable models
.
Biometrika
,
61
,
215
231
.

Hand
,
D.
(
2006
)
Classifier technology and the illusion of progress
.
Statistical Science
,
21
,
1
14
.

Hand
,
D.
&
Yu
,
K.
(
2001
)
Idiot's bayes - not so stupid after all?
International Statistical Review
,
69
,
385
398
.

Jones
,
G.
,
Johnson
,
W.
,
Hanson
,
T.
&
Christensen
,
R.
(
2010
)
Identifiability of models for multiple diagnostic testing in the absence of a gold standard
.
Biometrics
,
66
,
855
863
.

Li
,
X.
,
Xu
,
H.
,
Shen
,
C.
&
Grannis
,
S.
(
2018
)
Automated linkage of patient records from disparate sources
.
Statistical Methods in Medical Research
,
527
,
172
184
.

Newcombe
,
H.
&
Kennedy
,
J.
(
1962
)
Record linkage: making maximum use of the discriminating power of identifying information
.
Communications of the Associations for Computing Machinery (ACM)
,
5
,
563
566
.

Oberski
,
D.
&
Vermunt
,
J.
(
2018
)
The expected parameter change (EPC) for local dependence assessment in binary data latent class models
. arXiv preprint arXiv:1801.02400.

Oberski
,
D.
,
van
Kollenburg
,
G.H.
&
Vermunt
,
J.
(
2013
)
A monte carlo evaluation of three methods to detect local dependence in binary data latent class models
.
Advances in Data Analysis and Classification
,
7
,
267
279
.

Ong
,
T.
,
Mannino
,
M.
,
Schilling
,
L.
&
Kahn
,
M.
(
2014
)
Improving record linkage performance in the presence of missing linkage data
.
Journal of Biomedical Informatics
,
52
,
43
54
.

Qu
,
Y.
,
Tan
,
M.
&
Kutner
,
M.
(
1996
)
Random effects models in latent class analysis for evaluating accuracy of diagnostic tests
.
Biometrics
,
52
,
797
810
.

Rao
,
C.
(
1948
)
Large sample tests of statistical hypotheses concerning several parameters with applications to problems of estimation
.
Mathematical Proceedings of the Cambridge Philosophical Society
,
44
(
1
),
50
57
.

Sadinle
,
M.
(
2014
)
Detecting duplicates in a homicide registry using a Bayesian partitioning approach
.
Annals of Applied Statistics
,
8
,
2404
2434
.

Sadinle
,
M.
(
2017
)
Bayesian estimation of bipartite matchings for record linkage
.
Journal of the American Statistical Association
,
112
,
600
612
.

Sariyar
,
M.
,
Borg
,
A.
&
Pommerening
,
K.
(
2012
)
Missing values in deduplication of electronic patient data
.
Journal of the American Medical Informatics Association
,
19
,
e76
e82
.

Stanghellini
,
E.
&
Vantaggi
,
B.
(
2013
)
Identification of discrete concentration graph models with one hidden binary variable
.
Bernoulli
,
19
,
1820
1937
.

Subtil
,
A.
,
de
Oliveira
,
M.
&
Gonçalves
,
L.
(
2012
)
Conditional dependence diagnostic in the latent class model: a simulation study
.
Statistics & Probability Letters
,
82
,
1407
1412
.

Thibaudeau
,
Y.
(
1993
)
The discrimination power of dependency structures in record linkage
.
Survey Methodology
,
19
,
31
38
.

Vermunt
,
J.
&
Magidson
,
J.
(
2005
)
Technical guide for latent gold 4.0: basic and advanced
. Belmont, MA: Statistical Innovations.

Winkler
,
W.
(
1989
)
Methods for adjusting for lack of independence in an application of the Fellegi-Sunter model of record linkage
.
Survey Methodology
,
15
,
101
117
.

Winkler
,
W.
(
2014
)
Matching and record linkage
.
WIREs Computational Statistics
,
6
,
313
325
.

Xu
,
H.
,
Li
,
X.
,
Shen
,
C.
,
Hui
,
S.
&
Grannis
,
S.
(
2019
)
Incorporating conditional dependence in latent class models for probabilistic record linkage: Does it matter?
Annals of Applied Statistics
,
13
,
1753
1790
.

Xu
,
H.
,
Li
,
X.
&
Grannis
,
S.
(
2022
)
A simple two-step procedure using the Fellegi-Sunter model for frequency-based record linkage
.
Journal of Applied Statistics
,
49
,
2789
2804
.

APPENDIX

In the Appendix, we provide details about the score test statistics for the three types of conditional dependence structures for each pair of matching fields in the DOB-ZIP, TEL and NKLN-NKFN blocking schemes.

TABLE A1

Score test statistics for the DOB-ZIP blocking scheme of the newborn screening data deduplication

Pair numberField 1Field 2Score test statisticsMaximum statisticPseudo-R2
Constant dependencyNon-match onlyMatch only
1MRNADR688,998.842.4688,984.5688998.82.08%
2LNADR3814.252,5476.663.5525476.61.59%
3MRNSEX348,779.52066.5347,316.7348779.51.06%
4MRNMI293,642.7128.9293,789.4293789.40.89%
5LNMI168,218.8133.0174,438.2174438.20.53%
6MRNFN171,093.41.4171,133.3171133.30.52%
7FNSEX93,436.725,668.9141,812.9141812.90.43%
8LNFN36,857.2128,675.312,844.9128675.30.39%
9FNMI101,985.2128.4105,686.3105686.30.32%
10MIADR72,615.8932.975,678.075678.00.23%
11TELADR29051.338,113.223,125.938113.20.12%
12LNSEX10,709.929,746.14617.129746.10.09%
13LNTEL4368.427,598.41799.327598.40.08%
14MRNTEL24,744.932.324,758.524,758.50.07%
15MITEL9770.610.810,172.010,172.00.03%
16SEXMI8192.01446.39657.79657.70.03%
17MRNLN7567.7161.57581.97581.90.02%
18SEXADR147.13015.66495.96495.90.02%
19SEXTEL4857.56005.1104.46005.10.02%
20FNTEL15.65027.9177.85027.90.02%
21FNADR3680.10.13774.43774.40.01%
Pair numberField 1Field 2Score test statisticsMaximum statisticPseudo-R2
Constant dependencyNon-match onlyMatch only
1MRNADR688,998.842.4688,984.5688998.82.08%
2LNADR3814.252,5476.663.5525476.61.59%
3MRNSEX348,779.52066.5347,316.7348779.51.06%
4MRNMI293,642.7128.9293,789.4293789.40.89%
5LNMI168,218.8133.0174,438.2174438.20.53%
6MRNFN171,093.41.4171,133.3171133.30.52%
7FNSEX93,436.725,668.9141,812.9141812.90.43%
8LNFN36,857.2128,675.312,844.9128675.30.39%
9FNMI101,985.2128.4105,686.3105686.30.32%
10MIADR72,615.8932.975,678.075678.00.23%
11TELADR29051.338,113.223,125.938113.20.12%
12LNSEX10,709.929,746.14617.129746.10.09%
13LNTEL4368.427,598.41799.327598.40.08%
14MRNTEL24,744.932.324,758.524,758.50.07%
15MITEL9770.610.810,172.010,172.00.03%
16SEXMI8192.01446.39657.79657.70.03%
17MRNLN7567.7161.57581.97581.90.02%
18SEXADR147.13015.66495.96495.90.02%
19SEXTEL4857.56005.1104.46005.10.02%
20FNTEL15.65027.9177.85027.90.02%
21FNADR3680.10.13774.43774.40.01%
TABLE A1

Score test statistics for the DOB-ZIP blocking scheme of the newborn screening data deduplication

Pair numberField 1Field 2Score test statisticsMaximum statisticPseudo-R2
Constant dependencyNon-match onlyMatch only
1MRNADR688,998.842.4688,984.5688998.82.08%
2LNADR3814.252,5476.663.5525476.61.59%
3MRNSEX348,779.52066.5347,316.7348779.51.06%
4MRNMI293,642.7128.9293,789.4293789.40.89%
5LNMI168,218.8133.0174,438.2174438.20.53%
6MRNFN171,093.41.4171,133.3171133.30.52%
7FNSEX93,436.725,668.9141,812.9141812.90.43%
8LNFN36,857.2128,675.312,844.9128675.30.39%
9FNMI101,985.2128.4105,686.3105686.30.32%
10MIADR72,615.8932.975,678.075678.00.23%
11TELADR29051.338,113.223,125.938113.20.12%
12LNSEX10,709.929,746.14617.129746.10.09%
13LNTEL4368.427,598.41799.327598.40.08%
14MRNTEL24,744.932.324,758.524,758.50.07%
15MITEL9770.610.810,172.010,172.00.03%
16SEXMI8192.01446.39657.79657.70.03%
17MRNLN7567.7161.57581.97581.90.02%
18SEXADR147.13015.66495.96495.90.02%
19SEXTEL4857.56005.1104.46005.10.02%
20FNTEL15.65027.9177.85027.90.02%
21FNADR3680.10.13774.43774.40.01%
Pair numberField 1Field 2Score test statisticsMaximum statisticPseudo-R2
Constant dependencyNon-match onlyMatch only
1MRNADR688,998.842.4688,984.5688998.82.08%
2LNADR3814.252,5476.663.5525476.61.59%
3MRNSEX348,779.52066.5347,316.7348779.51.06%
4MRNMI293,642.7128.9293,789.4293789.40.89%
5LNMI168,218.8133.0174,438.2174438.20.53%
6MRNFN171,093.41.4171,133.3171133.30.52%
7FNSEX93,436.725,668.9141,812.9141812.90.43%
8LNFN36,857.2128,675.312,844.9128675.30.39%
9FNMI101,985.2128.4105,686.3105686.30.32%
10MIADR72,615.8932.975,678.075678.00.23%
11TELADR29051.338,113.223,125.938113.20.12%
12LNSEX10,709.929,746.14617.129746.10.09%
13LNTEL4368.427,598.41799.327598.40.08%
14MRNTEL24,744.932.324,758.524,758.50.07%
15MITEL9770.610.810,172.010,172.00.03%
16SEXMI8192.01446.39657.79657.70.03%
17MRNLN7567.7161.57581.97581.90.02%
18SEXADR147.13015.66495.96495.90.02%
19SEXTEL4857.56005.1104.46005.10.02%
20FNTEL15.65027.9177.85027.90.02%
21FNADR3680.10.13774.43774.40.01%
TABLE A2

Score test statistics for the TEL blocking scheme of the newborn screening data deduplication

Pair numberField 1Field 2Score test statisticsMaximum statisticPseudo-R2
Constant dependencyNon-match onlyMatch only
1ADRZIP85,621.967,099.7277,634.7277,634.72.83%
2LNMI79,263.311,005.5109,259.2109,259.21.12%
3FNDOB38,059.538,059.538,059.50.39%
4LNDOB33,755.833,755.833,755.80.34%
5MRNSEX29,630.829,630.829,630.80.30%
6SEXADR26,277.726,277.826,277.80.27%
7FNADR24,875.524,265.94349.624,875.50.25%
8FNSEX21,614.721,614.621,614.70.22%
9FNZIP16,972.516,444.32040.016,972.50.17%
10SEXDOB15,677.015,677.015,677.00.16%
11MRNADR1075.8314.514,875.414,875.40.15%
12MRNLN13,379.98586.67019.213,379.90.14%
13DOBADR12,753.512,753.512,753.50.13%
14MRNZIP12,145.612,494.7684.212,494.70.13%
15FNMI12,186.310,328.44511.412,186.30.12%
16LNFN8855.75367.410,265.010,265.00.10%
17LNSEX8523.28523.28523.20.09%
18MRNDOB6480.66480.66480.60.07%
19MIADR6435.95279.03656.16435.90.07%
20SEXZIP6342.96342.96342.90.06%
21DOBZIP6250.66250.66250.60.06%
22MIZIP5161.74465.91248.35161.70.05%
23MRNMI82.8739.23363.93363.90.03%
24LNADR2447.93285.3521.33285.30.03%
25LNZIP682.71282.0374.81282.00.01%
26MRNFN616.1975.8861.6975.80.01%
27MIDOB157.4157.4157.40.00%
28SEXMI29.429.429.40.00%
Pair numberField 1Field 2Score test statisticsMaximum statisticPseudo-R2
Constant dependencyNon-match onlyMatch only
1ADRZIP85,621.967,099.7277,634.7277,634.72.83%
2LNMI79,263.311,005.5109,259.2109,259.21.12%
3FNDOB38,059.538,059.538,059.50.39%
4LNDOB33,755.833,755.833,755.80.34%
5MRNSEX29,630.829,630.829,630.80.30%
6SEXADR26,277.726,277.826,277.80.27%
7FNADR24,875.524,265.94349.624,875.50.25%
8FNSEX21,614.721,614.621,614.70.22%
9FNZIP16,972.516,444.32040.016,972.50.17%
10SEXDOB15,677.015,677.015,677.00.16%
11MRNADR1075.8314.514,875.414,875.40.15%
12MRNLN13,379.98586.67019.213,379.90.14%
13DOBADR12,753.512,753.512,753.50.13%
14MRNZIP12,145.612,494.7684.212,494.70.13%
15FNMI12,186.310,328.44511.412,186.30.12%
16LNFN8855.75367.410,265.010,265.00.10%
17LNSEX8523.28523.28523.20.09%
18MRNDOB6480.66480.66480.60.07%
19MIADR6435.95279.03656.16435.90.07%
20SEXZIP6342.96342.96342.90.06%
21DOBZIP6250.66250.66250.60.06%
22MIZIP5161.74465.91248.35161.70.05%
23MRNMI82.8739.23363.93363.90.03%
24LNADR2447.93285.3521.33285.30.03%
25LNZIP682.71282.0374.81282.00.01%
26MRNFN616.1975.8861.6975.80.01%
27MIDOB157.4157.4157.40.00%
28SEXMI29.429.429.40.00%
TABLE A2

Score test statistics for the TEL blocking scheme of the newborn screening data deduplication

Pair numberField 1Field 2Score test statisticsMaximum statisticPseudo-R2
Constant dependencyNon-match onlyMatch only
1ADRZIP85,621.967,099.7277,634.7277,634.72.83%
2LNMI79,263.311,005.5109,259.2109,259.21.12%
3FNDOB38,059.538,059.538,059.50.39%
4LNDOB33,755.833,755.833,755.80.34%
5MRNSEX29,630.829,630.829,630.80.30%
6SEXADR26,277.726,277.826,277.80.27%
7FNADR24,875.524,265.94349.624,875.50.25%
8FNSEX21,614.721,614.621,614.70.22%
9FNZIP16,972.516,444.32040.016,972.50.17%
10SEXDOB15,677.015,677.015,677.00.16%
11MRNADR1075.8314.514,875.414,875.40.15%
12MRNLN13,379.98586.67019.213,379.90.14%
13DOBADR12,753.512,753.512,753.50.13%
14MRNZIP12,145.612,494.7684.212,494.70.13%
15FNMI12,186.310,328.44511.412,186.30.12%
16LNFN8855.75367.410,265.010,265.00.10%
17LNSEX8523.28523.28523.20.09%
18MRNDOB6480.66480.66480.60.07%
19MIADR6435.95279.03656.16435.90.07%
20SEXZIP6342.96342.96342.90.06%
21DOBZIP6250.66250.66250.60.06%
22MIZIP5161.74465.91248.35161.70.05%
23MRNMI82.8739.23363.93363.90.03%
24LNADR2447.93285.3521.33285.30.03%
25LNZIP682.71282.0374.81282.00.01%
26MRNFN616.1975.8861.6975.80.01%
27MIDOB157.4157.4157.40.00%
28SEXMI29.429.429.40.00%
Pair numberField 1Field 2Score test statisticsMaximum statisticPseudo-R2
Constant dependencyNon-match onlyMatch only
1ADRZIP85,621.967,099.7277,634.7277,634.72.83%
2LNMI79,263.311,005.5109,259.2109,259.21.12%
3FNDOB38,059.538,059.538,059.50.39%
4LNDOB33,755.833,755.833,755.80.34%
5MRNSEX29,630.829,630.829,630.80.30%
6SEXADR26,277.726,277.826,277.80.27%
7FNADR24,875.524,265.94349.624,875.50.25%
8FNSEX21,614.721,614.621,614.70.22%
9FNZIP16,972.516,444.32040.016,972.50.17%
10SEXDOB15,677.015,677.015,677.00.16%
11MRNADR1075.8314.514,875.414,875.40.15%
12MRNLN13,379.98586.67019.213,379.90.14%
13DOBADR12,753.512,753.512,753.50.13%
14MRNZIP12,145.612,494.7684.212,494.70.13%
15FNMI12,186.310,328.44511.412,186.30.12%
16LNFN8855.75367.410,265.010,265.00.10%
17LNSEX8523.28523.28523.20.09%
18MRNDOB6480.66480.66480.60.07%
19MIADR6435.95279.03656.16435.90.07%
20SEXZIP6342.96342.96342.90.06%
21DOBZIP6250.66250.66250.60.06%
22MIZIP5161.74465.91248.35161.70.05%
23MRNMI82.8739.23363.93363.90.03%
24LNADR2447.93285.3521.33285.30.03%
25LNZIP682.71282.0374.81282.00.01%
26MRNFN616.1975.8861.6975.80.01%
27MIDOB157.4157.4157.40.00%
28SEXMI29.429.429.40.00%
TABLE A3

Score test statistics for the NKLN-NKFN blocking scheme of the newborn screening data deduplication

Pair numberField 1Field 2Score test statisticsMaximum statisticPseudo-R2
Constant dependencyNon-match onlyMatch only
1MRNSEX151,154.32612.8164,152.1164,152.12.89%
2ADRZIP111,713.914,916.297,176.4111,713.91.97%
3LNMI86,724.0406.487,181.087,181.01.54%
4MRNADR72,124.1123.373,601.273,601.21.30%
5MRNFN48,627.324.349,221.049,221.00.87%
6FNSEX30,105.31912.436,854.736,854.70.65%
7FNMI28,196.2422.927,786.928,196.20.50%
8DOBZIP21,655.223,518.2104.823,518.20.41%
9TELADR23,371.43347.820,646.623,371.40.41%
10MRNMI19,263.42.019,307.719,307.70.34%
11MRNDOB17,954.317,732.62673.717,954.30.32%
12LNFN17,838.21033.417,097.017,838.20.31%
13MRNTEL13,916.30.114,228.314,228.30.25%
14SEXDOB10,174.19727.9590.010,174.10.18%
15DOBTEL7758.59667.075.39667.00.17%
16FNDOB8794.09553.1129.19553.10.17%
17LNADR4081.274.14140.14140.10.07%
18MITEL3989.50.24044.54044.50.07%
19SEXMI3193.811.33507.83507.80.06%
20DOBADR1506.02634.3179.82634.30.05%
21TELZIP2351.71358.81248.92351.70.04%
22MIADR2296.63.22331.52331.50.04%
23SEXADR1724.7416.11344.21724.70.03%
24LNTEL938.76.71181.21181.20.02%
25LNSEX456.21.61073.91073.90.02%
26FNTEL25.1670.5157.0670.50.01%
27SEXZIP86.5585.4490.2585.40.01%
28LNZIP322.87.0577.5577.50.01%
29SEXTEL70.6539.7118.1539.70.01%
30FNZIP118.3413.3481.3481.30.01%
31FNADR306.7134.4239.3306.70.01%
32MRNZIP120.6301.5245.1301.50.01%
33LNDOB211.6132.1149.8211.60.00%
34MIDOB128.733.8102.9128.70.00%
35MRNLN80.96.393.693.60.00%
36MIZIP45.025.962.762.70.00%
Pair numberField 1Field 2Score test statisticsMaximum statisticPseudo-R2
Constant dependencyNon-match onlyMatch only
1MRNSEX151,154.32612.8164,152.1164,152.12.89%
2ADRZIP111,713.914,916.297,176.4111,713.91.97%
3LNMI86,724.0406.487,181.087,181.01.54%
4MRNADR72,124.1123.373,601.273,601.21.30%
5MRNFN48,627.324.349,221.049,221.00.87%
6FNSEX30,105.31912.436,854.736,854.70.65%
7FNMI28,196.2422.927,786.928,196.20.50%
8DOBZIP21,655.223,518.2104.823,518.20.41%
9TELADR23,371.43347.820,646.623,371.40.41%
10MRNMI19,263.42.019,307.719,307.70.34%
11MRNDOB17,954.317,732.62673.717,954.30.32%
12LNFN17,838.21033.417,097.017,838.20.31%
13MRNTEL13,916.30.114,228.314,228.30.25%
14SEXDOB10,174.19727.9590.010,174.10.18%
15DOBTEL7758.59667.075.39667.00.17%
16FNDOB8794.09553.1129.19553.10.17%
17LNADR4081.274.14140.14140.10.07%
18MITEL3989.50.24044.54044.50.07%
19SEXMI3193.811.33507.83507.80.06%
20DOBADR1506.02634.3179.82634.30.05%
21TELZIP2351.71358.81248.92351.70.04%
22MIADR2296.63.22331.52331.50.04%
23SEXADR1724.7416.11344.21724.70.03%
24LNTEL938.76.71181.21181.20.02%
25LNSEX456.21.61073.91073.90.02%
26FNTEL25.1670.5157.0670.50.01%
27SEXZIP86.5585.4490.2585.40.01%
28LNZIP322.87.0577.5577.50.01%
29SEXTEL70.6539.7118.1539.70.01%
30FNZIP118.3413.3481.3481.30.01%
31FNADR306.7134.4239.3306.70.01%
32MRNZIP120.6301.5245.1301.50.01%
33LNDOB211.6132.1149.8211.60.00%
34MIDOB128.733.8102.9128.70.00%
35MRNLN80.96.393.693.60.00%
36MIZIP45.025.962.762.70.00%
TABLE A3

Score test statistics for the NKLN-NKFN blocking scheme of the newborn screening data deduplication

Pair numberField 1Field 2Score test statisticsMaximum statisticPseudo-R2
Constant dependencyNon-match onlyMatch only
1MRNSEX151,154.32612.8164,152.1164,152.12.89%
2ADRZIP111,713.914,916.297,176.4111,713.91.97%
3LNMI86,724.0406.487,181.087,181.01.54%
4MRNADR72,124.1123.373,601.273,601.21.30%
5MRNFN48,627.324.349,221.049,221.00.87%
6FNSEX30,105.31912.436,854.736,854.70.65%
7FNMI28,196.2422.927,786.928,196.20.50%
8DOBZIP21,655.223,518.2104.823,518.20.41%
9TELADR23,371.43347.820,646.623,371.40.41%
10MRNMI19,263.42.019,307.719,307.70.34%
11MRNDOB17,954.317,732.62673.717,954.30.32%
12LNFN17,838.21033.417,097.017,838.20.31%
13MRNTEL13,916.30.114,228.314,228.30.25%
14SEXDOB10,174.19727.9590.010,174.10.18%
15DOBTEL7758.59667.075.39667.00.17%
16FNDOB8794.09553.1129.19553.10.17%
17LNADR4081.274.14140.14140.10.07%
18MITEL3989.50.24044.54044.50.07%
19SEXMI3193.811.33507.83507.80.06%
20DOBADR1506.02634.3179.82634.30.05%
21TELZIP2351.71358.81248.92351.70.04%
22MIADR2296.63.22331.52331.50.04%
23SEXADR1724.7416.11344.21724.70.03%
24LNTEL938.76.71181.21181.20.02%
25LNSEX456.21.61073.91073.90.02%
26FNTEL25.1670.5157.0670.50.01%
27SEXZIP86.5585.4490.2585.40.01%
28LNZIP322.87.0577.5577.50.01%
29SEXTEL70.6539.7118.1539.70.01%
30FNZIP118.3413.3481.3481.30.01%
31FNADR306.7134.4239.3306.70.01%
32MRNZIP120.6301.5245.1301.50.01%
33LNDOB211.6132.1149.8211.60.00%
34MIDOB128.733.8102.9128.70.00%
35MRNLN80.96.393.693.60.00%
36MIZIP45.025.962.762.70.00%
Pair numberField 1Field 2Score test statisticsMaximum statisticPseudo-R2
Constant dependencyNon-match onlyMatch only
1MRNSEX151,154.32612.8164,152.1164,152.12.89%
2ADRZIP111,713.914,916.297,176.4111,713.91.97%
3LNMI86,724.0406.487,181.087,181.01.54%
4MRNADR72,124.1123.373,601.273,601.21.30%
5MRNFN48,627.324.349,221.049,221.00.87%
6FNSEX30,105.31912.436,854.736,854.70.65%
7FNMI28,196.2422.927,786.928,196.20.50%
8DOBZIP21,655.223,518.2104.823,518.20.41%
9TELADR23,371.43347.820,646.623,371.40.41%
10MRNMI19,263.42.019,307.719,307.70.34%
11MRNDOB17,954.317,732.62673.717,954.30.32%
12LNFN17,838.21033.417,097.017,838.20.31%
13MRNTEL13,916.30.114,228.314,228.30.25%
14SEXDOB10,174.19727.9590.010,174.10.18%
15DOBTEL7758.59667.075.39667.00.17%
16FNDOB8794.09553.1129.19553.10.17%
17LNADR4081.274.14140.14140.10.07%
18MITEL3989.50.24044.54044.50.07%
19SEXMI3193.811.33507.83507.80.06%
20DOBADR1506.02634.3179.82634.30.05%
21TELZIP2351.71358.81248.92351.70.04%
22MIADR2296.63.22331.52331.50.04%
23SEXADR1724.7416.11344.21724.70.03%
24LNTEL938.76.71181.21181.20.02%
25LNSEX456.21.61073.91073.90.02%
26FNTEL25.1670.5157.0670.50.01%
27SEXZIP86.5585.4490.2585.40.01%
28LNZIP322.87.0577.5577.50.01%
29SEXTEL70.6539.7118.1539.70.01%
30FNZIP118.3413.3481.3481.30.01%
31FNADR306.7134.4239.3306.70.01%
32MRNZIP120.6301.5245.1301.50.01%
33LNDOB211.6132.1149.8211.60.00%
34MIDOB128.733.8102.9128.70.00%
35MRNLN80.96.393.693.60.00%
36MIZIP45.025.962.762.70.00%

Author notes

Funding information Agency for Healthcare Research and Quality, Grant/Award Number: R01HS023808; Patient-Centered Outcomes Research Institute, Grant/Award Number: ME-2017C1-6425

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)