Suppr超能文献

[概率性数据链接中自动匹配分类方法的评估]

[Assessment of a method for automatic match classification in probabilistic data linkage].

作者信息

Duarte Daniela de Almeida Pereira, Corrêa Camila Soares Lima, Fayer Vívian Assis, Nogueira Mário Círio, Bustamante-Teixeira Maria Teresa

机构信息

Universidade Federal de Juiz de Fora, Juiz de Fora, Brasil.

Divisão de Saúde, Universidade Federal de Viçosa, Viçosa, Brasil.

出版信息

Cad Saude Publica. 2019 Nov 11;35(11):e00066419. doi: 10.1590/0102-311X00066419. eCollection 2019.

Abstract

The objective was to test and assess the accuracy of a scoring method in probabilistic data linkage in order to enable automatic identification of true matches, dispensing with the manual inspection stage. Accuracy study using data from the Breast Cancer Information System (SISMAMA) base in Minas Gerais State, Brazil, from 2009 and 2010. After cleaning and standardization, a 16-step probabilistic linkage of the 2009 and 2010 databases was performed, where each step was inspected manually to obtain a gold standard. Samples were then selected, inspected, and assessed to calculate the method's accuracy in selecting true matches. All the steps and samples with 200 and 300 matches showed high sensitivity (recall) > 0.97, high positive predictive value (precision) > 0.95, high accuracy (> 0.97) and F measure (> 0.96), and high area under the curve precision-recall (> 0.98). The sample with 100 matches showed high values for these measures, but with low scores. Of the 16 steps assessed, the combined use of only three was sufficient to identify 99.24% of the true matches in the total database. The proposed method allows automatically linking databases, maintaining the method's accuracy. It facilitates the use of probabilistic linkage in health services, especially for health surveillance and management.

摘要

目的是测试和评估概率数据链接中一种评分方法的准确性,以便能够自动识别真正匹配项,省去人工检查阶段。使用来自巴西米纳斯吉拉斯州乳腺癌信息系统(SISMAMA)数据库2009年和2010年的数据进行准确性研究。在清理和标准化之后,对2009年和2010年的数据库进行了16步概率链接,其中每一步都进行人工检查以获得金标准。然后选择、检查和评估样本,以计算该方法在选择真正匹配项方面的准确性。所有有200和300个匹配项的步骤和样本均显示出高灵敏度(召回率)>0.97、高阳性预测值(精确率)>0.95、高准确率(>0.97)和F值(>0.96),以及高精度召回率曲线下面积(>0.98)。有100个匹配项的样本这些指标值较高,但分数较低。在评估的16个步骤中,仅结合使用其中三个步骤就足以识别总数据库中99.24%的真正匹配项。所提出的方法允许自动链接数据库,保持该方法的准确性。它便于在卫生服务中使用概率链接,特别是用于健康监测和管理。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验