探索真实世界健康数据记录链接的复杂性——一项连接癌症登记处和理赔数据的实例研究

Exploring the Complexity of Real-World Health Data Record Linkage-An Exemplary Study Linking Cancer Registry and Claims Data.

作者信息

Lendle Nadja, Kollhorst Bianca, Intemann Timm

机构信息

Department of Biometry and Data Management, Leibniz-Institute for Prevention Research and Epidemiology - BIPS, Bremen, Germany.

出版信息

Pharmacoepidemiol Drug Saf. 2025 Apr;34(4):e70120. doi: 10.1002/pds.70120.

DOI:10.1002/pds.70120

PMID:40130753

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11934838/

Abstract

PURPOSE

Record linkage based on quasi-identifiers remains an important approach as not every data source provides a comprehensive unique identifier. In this study, the reasons for the failure of a linkage based on quasi-identifiers were examined. Furthermore, informed algorithms using information on gold standard links were developed to investigate the potentially achievable linkage quality based on quasi-identifiers.

METHODS

The study population includes patients from an antidiabetic cohort from German claims and colorectal cancer patients from two German cancer registries. Linkage algorithms were applied using information on gold standard links. Informed linkage algorithms based on deterministic linkage, logistic regression, random forests, gradient boosting, and neural networks were derived and compared. Descriptive analyses were performed to identify reasons for the failure of linkage, such as discrepancies between data sources.

RESULTS

A gradient boosting-based linkage approach performed best, achieving a precision (positive predictive value) of 77%, a recall (sensitivity) of 81%, and an F*-measure (combining precision and recall) of 64%. Of 641 patients in GePaRD, 8% were not uniquely identifiable using birth year, sex, area of residence, and year and quarter of diagnosis, whereas 33% of 42 817 cancer registry patients were not uniquely identifiable with these quasi-identifiers.

CONCLUSIONS

Linkage of German claims and cancer registry data based on quasi-identifiers does result in insufficient linkage quality since subjects cannot be uniquely identified. It is advisable to use unique identifiers from a subsample, if available, to derive informed linkage algorithms for the entire sample. In this case, the machine learning technique gradient boosting has been found to outperform other methods.

摘要

目的

由于并非每个数据源都提供全面的唯一标识符，基于准标识符的记录链接仍然是一种重要的方法。在本研究中，我们考察了基于准标识符的链接失败的原因。此外，还开发了利用金标准链接信息的智能算法，以研究基于准标识符可能实现的链接质量。

方法

研究人群包括来自德国理赔数据库的抗糖尿病队列患者和来自两个德国癌症登记处的结直肠癌患者。使用金标准链接信息应用链接算法。推导并比较了基于确定性链接、逻辑回归、随机森林、梯度提升和神经网络的智能链接算法。进行描述性分析以确定链接失败的原因，例如数据源之间的差异。

结果

基于梯度提升的链接方法表现最佳，精确率（阳性预测值）达到77%，召回率（敏感度）达到81%，F*值（结合精确率和召回率）达到64%。在GePaRD的641名患者中，8%使用出生年份、性别、居住地区以及诊断年份和季度无法唯一识别，而在42817名癌症登记处患者中，33%使用这些准标识符无法唯一识别。

结论

基于准标识符对德国理赔数据和癌症登记数据进行链接，由于无法唯一识别个体，导致链接质量不足。如果有可用的子样本的唯一标识符，建议使用它来为整个样本推导智能链接算法。在这种情况下，已发现机器学习技术梯度提升优于其他方法。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

探索真实世界健康数据记录链接的复杂性——一项连接癌症登记处和理赔数据的实例研究

Exploring the Complexity of Real-World Health Data Record Linkage-An Exemplary Study Linking Cancer Registry and Claims Data.

作者信息

机构信息

出版信息

PURPOSE

METHODS

RESULTS

CONCLUSIONS

目的

方法

结果

结论

相似文献

本文引用的文献

探索真实世界健康数据记录链接的复杂性——一项连接癌症登记处和理赔数据的实例研究

Exploring the Complexity of Real-World Health Data Record Linkage-An Exemplary Study Linking Cancer Registry and Claims Data.

作者信息

机构信息

出版信息

PURPOSE

METHODS

RESULTS

CONCLUSIONS

目的

方法

结果

结论

相似文献

本文引用的文献