Daggy Joanne, Xu Huiping, Hui Siu, Grannis Shaun
Department of Biostatistics, Indiana University School of Medicine, Indianapolis, IN, 46202, U.S.A.
Stat Med. 2014 Oct 30;33(24):4250-65. doi: 10.1002/sim.6230. Epub 2014 Jun 17.
Record linkage methods commonly use a traditional latent class model to classify record pairs from different sources as true matches or non-matches. This approach was first formally described by Fellegi and Sunter and assumes that the agreement in fields is independent conditional on the latent class. Consequences of violating the conditional independence assumption include bias in parameter estimates from the model. We sought to further characterize the impact of conditional dependence on the overall misclassification rate, sensitivity, and positive predictive value in the record linkage problem when the conditional independence assumption is violated. Additionally, we evaluate various methods to account for the conditional dependence. These methods include loglinear models with appropriate interaction terms identified through the correlation residual plot as well as Gaussian random effects models. The proposed models are used to link newborn screening data obtained from a health information exchange. On the basis of simulations, loglinear models with interaction terms demonstrated the best misclassification rate, although this type of model cannot accommodate other data features such as continuous measures for agreement. Results indicate that Gaussian random effects models, which can handle additional data features, perform better than assuming conditional independence and in some situations perform as well as the loglinear model with interaction terms.
记录链接方法通常使用传统的潜在类别模型,将来自不同来源的记录对分类为真实匹配或不匹配。这种方法最早由费勒吉和桑特正式描述,并假设字段中的一致性在潜在类别条件下是独立的。违反条件独立性假设的后果包括模型参数估计中的偏差。我们试图进一步描述在违反条件独立性假设时,条件依赖性对记录链接问题中总体错误分类率、敏感性和阳性预测值的影响。此外,我们评估了各种考虑条件依赖性的方法。这些方法包括通过相关残差图识别出具有适当交互项的对数线性模型以及高斯随机效应模型。所提出的模型用于链接从健康信息交换中获得的新生儿筛查数据。基于模拟,具有交互项的对数线性模型显示出最佳的错误分类率,尽管这种类型的模型无法适应其他数据特征,如一致性的连续测量。结果表明,能够处理其他数据特征的高斯随机效应模型比假设条件独立性表现更好,并且在某些情况下与具有交互项的对数线性模型表现相当。