Suppr超能文献

数据自适应 Fellegi-Sunter 模型在概率记录链接中的应用:纳入缺失数据和字段选择的算法开发和验证。

The Data-Adaptive Fellegi-Sunter Model for Probabilistic Record Linkage: Algorithm Development and Validation for Incorporating Missing Data and Field Selection.

机构信息

Department of Biostatistics and Health Data Science, Indiana University School of Medicine, The Richard M. Fairbanks School of Public Health, Indianapolis, IN, United States.

Data and Analytics, Regenstrief Institute Inc., Indiana University School of Medicine, Indianapolis, IN, United States.

出版信息

J Med Internet Res. 2022 Sep 29;24(9):e33775. doi: 10.2196/33775.

Abstract

BACKGROUND

Quality patient care requires comprehensive health care data from a broad set of sources. However, missing data in medical records and matching field selection are 2 real-world challenges in patient-record linkage.

OBJECTIVE

In this study, we aimed to evaluate the extent to which incorporating the missing at random (MAR)-assumption in the Fellegi-Sunter model and using data-driven selected fields improve patient-matching accuracy using real-world use cases.

METHODS

We adapted the Fellegi-Sunter model to accommodate missing data using the MAR assumption and compared the adaptation to the common strategy of treating missing values as disagreement with matching fields specified by experts or selected by data-driven methods. We used 4 use cases, each containing a random sample of record pairs with match statuses ascertained by manual reviews. Use cases included health information exchange (HIE) record deduplication, linkage of public health registry records to HIE, linkage of Social Security Death Master File records to HIE, and deduplication of newborn screening records, which represent real-world clinical and public health scenarios. Matching performance was evaluated using the sensitivity, specificity, positive predictive value, negative predictive value, and F1-score.

RESULTS

Incorporating the MAR assumption in the Fellegi-Sunter model maintained or improved F1-scores, regardless of whether matching fields were expert-specified or selected by data-driven methods. Combining the MAR assumption and data-driven fields optimized the F1-scores in the 4 use cases.

CONCLUSIONS

MAR is a reasonable assumption in real-world record linkage applications: it maintains or improves F1-scores regardless of whether matching fields are expert-specified or data-driven. Data-driven selection of fields coupled with MAR achieves the best overall performance, which can be especially useful in privacy-preserving record linkage.

摘要

背景

高质量的患者护理需要来自广泛来源的全面医疗保健数据。然而,在患者记录链接中,医学记录中存在缺失数据和匹配字段选择是两个现实世界的挑战。

目的

在这项研究中,我们旨在评估在 Fellegi-Sunter 模型中纳入随机缺失(MAR)假设并使用数据驱动选择字段在使用真实用例时,对提高患者匹配准确性的程度。

方法

我们使用 MAR 假设改编了 Fellegi-Sunter 模型以适应缺失数据,并将改编后的模型与将缺失值视为与专家指定或数据驱动方法选择的匹配字段不一致的常见策略进行了比较。我们使用了 4 个用例,每个用例都包含一个记录对随机样本,通过手动审查确定匹配状态。用例包括健康信息交换(HIE)记录去重、公共卫生注册表记录与 HIE 的链接、社会保障死亡主文件记录与 HIE 的链接以及新生儿筛查记录的去重,这些都代表了现实世界中的临床和公共卫生场景。使用灵敏度、特异性、阳性预测值、阴性预测值和 F1 分数评估匹配性能。

结果

在 Fellegi-Sunter 模型中纳入 MAR 假设无论匹配字段是专家指定还是通过数据驱动方法选择,都保持或提高了 F1 分数。将 MAR 假设与数据驱动字段相结合优化了 4 个用例中的 F1 分数。

结论

MAR 是现实世界记录链接应用中的合理假设:无论匹配字段是专家指定还是数据驱动,它都能保持或提高 F1 分数。结合 MAR 的字段数据驱动选择可实现最佳的整体性能,这在隐私保护的记录链接中尤为有用。

相似文献

2
Variable selection for latent class analysis in the presence of missing data with application to record linkage.
Stat Methods Med Res. 2024 Jun;33(6):966-980. doi: 10.1177/09622802241242317. Epub 2024 Apr 9.
3
A simple two-step procedure using the Fellegi-Sunter model for frequency-based record linkage.
J Appl Stat. 2021 May 4;49(11):2789-2804. doi: 10.1080/02664763.2021.1922615. eCollection 2022.
4
Extending the Fellegi-Sunter probabilistic record linkage method for approximate field comparators.
J Biomed Inform. 2010 Feb;43(1):24-30. doi: 10.1016/j.jbi.2009.08.004. Epub 2009 Aug 13.
5
Automated linkage of patient records from disparate sources.
Stat Methods Med Res. 2018 Jan;27(1):172-184. doi: 10.1177/0962280215626180. Epub 2016 Jul 20.
6
A new computationally efficient algorithm for record linkage with field dependency and missing data imputation.
Int J Med Inform. 2018 Jan;109:70-75. doi: 10.1016/j.ijmedinf.2017.10.021. Epub 2017 Nov 6.
7
An empiric modification to the probabilistic record linkage algorithm using frequency-based weight scaling.
J Am Med Inform Assoc. 2009 Sep-Oct;16(5):738-45. doi: 10.1197/jamia.M3186. Epub 2009 Jun 30.
8
Improving record linkage performance in the presence of missing linkage data.
J Biomed Inform. 2014 Dec;52:43-54. doi: 10.1016/j.jbi.2014.01.016. Epub 2014 Feb 10.
9
An Introduction to Probabilistic Record Linkage with a Focus on Linkage Processing for WTC Registries.
Int J Environ Res Public Health. 2020 Sep 22;17(18):6937. doi: 10.3390/ijerph17186937.
10
Evaluating the effect of data standardization and validation on patient matching accuracy.
J Am Med Inform Assoc. 2019 May 1;26(5):447-456. doi: 10.1093/jamia/ocy191.

本文引用的文献

1
Estimating parameters for probabilistic linkage of privacy-preserved datasets.
BMC Med Res Methodol. 2017 Jul 10;17(1):95. doi: 10.1186/s12874-017-0370-0.
4
A practical approach for incorporating dependence among fields in probabilistic record linkage.
BMC Med Inform Decis Mak. 2013 Aug 30;13:97. doi: 10.1186/1472-6947-13-97.
7
Unpredictable bias when using the missing indicator method or complete case analysis for missing confounder values: an empirical example.
J Clin Epidemiol. 2010 Jul;63(7):728-36. doi: 10.1016/j.jclinepi.2009.08.028. Epub 2010 Mar 25.
8
Privacy-preserving record linkage using Bloom filters.
BMC Med Inform Decis Mak. 2009 Aug 25;9:41. doi: 10.1186/1472-6947-9-41.
9
An empiric modification to the probabilistic record linkage algorithm using frequency-based weight scaling.
J Am Med Inform Assoc. 2009 Sep-Oct;16(5):738-45. doi: 10.1197/jamia.M3186. Epub 2009 Jun 30.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验