Department of Biostatistics and Health Data Science, Indiana University School of Medicine, The Richard M. Fairbanks School of Public Health, Indianapolis, IN, United States.
Data and Analytics, Regenstrief Institute Inc., Indiana University School of Medicine, Indianapolis, IN, United States.
J Med Internet Res. 2022 Sep 29;24(9):e33775. doi: 10.2196/33775.
Quality patient care requires comprehensive health care data from a broad set of sources. However, missing data in medical records and matching field selection are 2 real-world challenges in patient-record linkage.
In this study, we aimed to evaluate the extent to which incorporating the missing at random (MAR)-assumption in the Fellegi-Sunter model and using data-driven selected fields improve patient-matching accuracy using real-world use cases.
We adapted the Fellegi-Sunter model to accommodate missing data using the MAR assumption and compared the adaptation to the common strategy of treating missing values as disagreement with matching fields specified by experts or selected by data-driven methods. We used 4 use cases, each containing a random sample of record pairs with match statuses ascertained by manual reviews. Use cases included health information exchange (HIE) record deduplication, linkage of public health registry records to HIE, linkage of Social Security Death Master File records to HIE, and deduplication of newborn screening records, which represent real-world clinical and public health scenarios. Matching performance was evaluated using the sensitivity, specificity, positive predictive value, negative predictive value, and F1-score.
Incorporating the MAR assumption in the Fellegi-Sunter model maintained or improved F1-scores, regardless of whether matching fields were expert-specified or selected by data-driven methods. Combining the MAR assumption and data-driven fields optimized the F1-scores in the 4 use cases.
MAR is a reasonable assumption in real-world record linkage applications: it maintains or improves F1-scores regardless of whether matching fields are expert-specified or data-driven. Data-driven selection of fields coupled with MAR achieves the best overall performance, which can be especially useful in privacy-preserving record linkage.
高质量的患者护理需要来自广泛来源的全面医疗保健数据。然而,在患者记录链接中,医学记录中存在缺失数据和匹配字段选择是两个现实世界的挑战。
在这项研究中,我们旨在评估在 Fellegi-Sunter 模型中纳入随机缺失(MAR)假设并使用数据驱动选择字段在使用真实用例时,对提高患者匹配准确性的程度。
我们使用 MAR 假设改编了 Fellegi-Sunter 模型以适应缺失数据,并将改编后的模型与将缺失值视为与专家指定或数据驱动方法选择的匹配字段不一致的常见策略进行了比较。我们使用了 4 个用例,每个用例都包含一个记录对随机样本,通过手动审查确定匹配状态。用例包括健康信息交换(HIE)记录去重、公共卫生注册表记录与 HIE 的链接、社会保障死亡主文件记录与 HIE 的链接以及新生儿筛查记录的去重,这些都代表了现实世界中的临床和公共卫生场景。使用灵敏度、特异性、阳性预测值、阴性预测值和 F1 分数评估匹配性能。
在 Fellegi-Sunter 模型中纳入 MAR 假设无论匹配字段是专家指定还是通过数据驱动方法选择,都保持或提高了 F1 分数。将 MAR 假设与数据驱动字段相结合优化了 4 个用例中的 F1 分数。
MAR 是现实世界记录链接应用中的合理假设:无论匹配字段是专家指定还是数据驱动,它都能保持或提高 F1 分数。结合 MAR 的字段数据驱动选择可实现最佳的整体性能,这在隐私保护的记录链接中尤为有用。