Duarte Daniela de Almeida Pereira, Corrêa Camila Soares Lima, Fayer Vívian Assis, Nogueira Mário Círio, Bustamante-Teixeira Maria Teresa
Universidade Federal de Juiz de Fora, Juiz de Fora, Brasil.
Divisão de Saúde, Universidade Federal de Viçosa, Viçosa, Brasil.
Cad Saude Publica. 2019 Nov 11;35(11):e00066419. doi: 10.1590/0102-311X00066419. eCollection 2019.
The objective was to test and assess the accuracy of a scoring method in probabilistic data linkage in order to enable automatic identification of true matches, dispensing with the manual inspection stage. Accuracy study using data from the Breast Cancer Information System (SISMAMA) base in Minas Gerais State, Brazil, from 2009 and 2010. After cleaning and standardization, a 16-step probabilistic linkage of the 2009 and 2010 databases was performed, where each step was inspected manually to obtain a gold standard. Samples were then selected, inspected, and assessed to calculate the method's accuracy in selecting true matches. All the steps and samples with 200 and 300 matches showed high sensitivity (recall) > 0.97, high positive predictive value (precision) > 0.95, high accuracy (> 0.97) and F measure (> 0.96), and high area under the curve precision-recall (> 0.98). The sample with 100 matches showed high values for these measures, but with low scores. Of the 16 steps assessed, the combined use of only three was sufficient to identify 99.24% of the true matches in the total database. The proposed method allows automatically linking databases, maintaining the method's accuracy. It facilitates the use of probabilistic linkage in health services, especially for health surveillance and management.
目的是测试和评估概率数据链接中一种评分方法的准确性,以便能够自动识别真正匹配项,省去人工检查阶段。使用来自巴西米纳斯吉拉斯州乳腺癌信息系统(SISMAMA)数据库2009年和2010年的数据进行准确性研究。在清理和标准化之后,对2009年和2010年的数据库进行了16步概率链接,其中每一步都进行人工检查以获得金标准。然后选择、检查和评估样本,以计算该方法在选择真正匹配项方面的准确性。所有有200和300个匹配项的步骤和样本均显示出高灵敏度(召回率)>0.97、高阳性预测值(精确率)>0.95、高准确率(>0.97)和F值(>0.96),以及高精度召回率曲线下面积(>0.98)。有100个匹配项的样本这些指标值较高,但分数较低。在评估的16个步骤中,仅结合使用其中三个步骤就足以识别总数据库中99.24%的真正匹配项。所提出的方法允许自动链接数据库,保持该方法的准确性。它便于在卫生服务中使用概率链接,特别是用于健康监测和管理。