• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

概率性和确定性记录链接的准确性:以结核病为例。

Accuracy of probabilistic and deterministic record linkage: the case of tuberculosis.

作者信息

Oliveira Gisele Pinto de, Bierrenbach Ana Luiza de Souza, Camargo Kenneth Rochel de, Coeli Cláudia Medina, Pinheiro Rejane Sobrino

机构信息

Programa de Pós-Graduação em Saúde Coletiva. Instituto de Estudos em Saúde Coletiva. Universidade Federal do Rio de Janeiro. Rio de Janeiro, RJ, Brasil.

Instituto de Ensino e Pesquisa. Hospital Sírio-Libanês. São Paulo, SP, Brasil.

出版信息

Rev Saude Publica. 2016 Aug 22;50:49. doi: 10.1590/S1518-8787.2016050006327.

DOI:10.1590/S1518-8787.2016050006327
PMID:27556963
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4988803/
Abstract

OBJECTIVE

To analyze the accuracy of deterministic and probabilistic record linkage to identify TB duplicate records, as well as the characteristics of discordant pairs.

METHODS

The study analyzed all TB records from 2009 to 2011 in the state of Rio de Janeiro. A deterministic record linkage algorithm was developed using a set of 70 rules, based on the combination of fragments of the key variables with or without modification (Soundex or substring). Each rule was formed by three or more fragments. The probabilistic approach required a cutoff point for the score, above which the links would be automatically classified as belonging to the same individual. The cutoff point was obtained by linkage of the Notifiable Diseases Information System - Tuberculosis database with itself, subsequent manual review and ROC curves and precision-recall. Sensitivity and specificity for accurate analysis were calculated.

RESULTS

Accuracy ranged from 87.2% to 95.2% for sensitivity and 99.8% to 99.9% for specificity for probabilistic and deterministic record linkage, respectively. The occurrence of missing values for the key variables and the low percentage of similarity measure for name and date of birth were mainly responsible for the failure to identify records of the same individual with the techniques used.

CONCLUSIONS

The two techniques showed a high level of correlation for pair classification. Although deterministic linkage identified more duplicate records than probabilistic linkage, the latter retrieved records not identified by the former. User need and experience should be considered when choosing the best technique to be used.

OBJETIVO

Analisar a acurácia das técnicas determinística e probabilística para identificação de registros duplicados de tuberculose, assim como as características dos pares discordantes.

MÉTODOS: Foram analisados todos os registros de tuberculose no período de 2009 a 2011 do estado do Rio de Janeiro. Foi desenvolvido algoritmo para relacionamento determinístico, usando conjunto de 70 regras, a partir da combinação de fragmentos das variáveis-chave com ou sem modificações (Soundex ou substring). Cada regra era formada por três ou mais fragmentos. Para a abordagem probabilística, foi necessário estabelecer ponto de corte para o escore, acima do qual os links seriam classificados automaticamente como pertencentes ao mesmo indivíduo. O ponto de corte foi obtido por meio do relacionamento da base de dados Sistema de Informação de Agravos de Notificação - Tuberculose com ela mesma, posterior revisão manual e curvas ROC e precision-recall. Foram calculadas a sensibilidade e especificidade para análise de acurácia.

RESULTADOS

A acurácia variou de 87,2% a 95,2% para sensibilidade e 99,8% a 99,9% para especificidade para as técnicas probabilística e determinística, respectivamente. A presença de valores faltantes para as variáveis-chave e o baixo percentual da medida de similaridade para o nome e data de nascimento foram os principais responsáveis pela não identificação dos registros do mesmo indivíduo pelas técnicas utilizadas.

CONCLUSÕES: As duas técnicas apresentam alta concordância para a classificação como par. Apesar de a técnica determinística ter identificado mais registros duplicados que a probabilística, a segunda recuperou registros não identificados pela primeira. A necessidade e a experiência do usuário devem ser consideradas para a escolha da técnica a ser utilizada.

摘要

目的

分析确定性和概率性记录链接识别结核病重复记录的准确性以及不一致配对的特征。

方法

该研究分析了里约热内卢州2009年至2011年的所有结核病记录。基于关键变量片段的组合(有或没有修改,如语音相似性编码或子串),使用一组70条规则开发了一种确定性记录链接算法。每条规则由三个或更多片段组成。概率性方法需要一个分数截止点,高于该截止点的链接将被自动分类为属于同一个体。通过将法定传染病信息系统 - 结核病数据库与其自身进行链接、随后的人工审核以及ROC曲线和精确召回率来获得截止点。计算准确分析的敏感性和特异性。

结果

概率性和确定性记录链接的敏感性准确率分别为87.2%至95.2%,特异性准确率分别为99.8%至99.9%。关键变量缺失值的出现以及姓名和出生日期相似性度量的低百分比是导致使用这些技术未能识别同一个体记录的主要原因。

结论

这两种技术在配对分类方面显示出高度相关性。尽管确定性链接识别出的重复记录比概率性链接更多,但后者找回了前者未识别的记录。在选择最佳使用技术时应考虑用户需求和经验。

目标

分析确定性和概率性技术识别结核病重复记录的准确性以及不一致配对的特征。

方法

分析了里约热内卢州2009年至2011年期间的所有结核病记录。开发了确定性关联算法,使用一组70条规则,基于关键变量片段的组合,有或没有修改(语音相似性编码或子串)。每条规则由三个或更多片段组成。对于概率性方法,需要为分数设定一个截止点,高于该截止点的链接将被自动分类为属于同一个体。通过将法定传染病信息系统 - 结核病数据库与其自身进行关联、随后的人工审核以及ROC曲线和精确召回率来获得截止点。计算准确性分析的敏感性和特异性。

结果

概率性技术的敏感性准确率为87.2%至95.2%,确定性技术的敏感性准确率为99.8%至99.9%。关键变量存在缺失值以及姓名和出生日期相似性度量的低百分比是使用这些技术未能识别同一个体记录的主要原因。

结论

这两种技术在分类为配对方面具有高度一致性。尽管确定性技术识别出的重复记录比概率性技术多,但概率性技术找回了确定性技术未识别的记录。在选择要使用的技术时应考虑用户需求和经验。

相似文献

1
Accuracy of probabilistic and deterministic record linkage: the case of tuberculosis.概率性和确定性记录链接的准确性:以结核病为例。
Rev Saude Publica. 2016 Aug 22;50:49. doi: 10.1590/S1518-8787.2016050006327.
2
Risk factors associated with multidrug-resistant tuberculosis in Espírito Santo, Brazil.巴西圣埃斯皮里图州耐多药结核病的相关危险因素。
Rev Saude Publica. 2017 Apr 27;51(0):41. doi: 10.1590/S1518-8787.2017051006688.
3
Clinical-Functional Vulnerability Index-20 (IVCF-20): rapid recognition of frail older adults.临床功能脆弱指数-20(IVCF-20):快速识别体弱老年人。
Rev Saude Publica. 2016 Dec 22;50:81. doi: 10.1590/S1518-8787.2016050006963.
4
Epidemiological surveillance of tegumentary leishmaniasis: local territorial analysis.皮肤利什曼病的流行病学监测:局部区域分析
Rev Saude Publica. 2017 Jun 26;51:51. doi: 10.1590/S1518-8787.2017051006614.
5
Contribution of Oswaldo Paulo Forattini to public health: analysis of scientific production.奥斯瓦尔多·保罗·福拉蒂尼对公共卫生的贡献:科研成果分析
Rev Saude Publica. 2016 Dec 22;50:73. doi: 10.1590/S1518-8787.2016050000217.
6
Access to and use of high blood pressure medications in Brazil.巴西高血压药物的获取与使用情况。
Rev Saude Publica. 2016 Dec;50(suppl 2):8s. doi: 10.1590/S1518-8787.2016050006154.
7
Fuzzy model to estimate the number of hospitalizations for asthma and pneumonia under the effects of air pollution.用于估计空气污染影响下哮喘和肺炎住院人数的模糊模型。
Rev Saude Publica. 2017 Jun 22;51:55. doi: 10.1590/S1518-8787.2017051006501.
8
Blood Pressure Treatment Adherence and Control after Participation in the ReHOT.参与ReHOT后血压治疗的依从性与控制情况
Arq Bras Cardiol. 2016 Nov;107(5):437-445. doi: 10.5935/abc.20160165.
9
Performance of the dipstick screening test as a predictor of negative urine culture.试纸条筛查试验作为尿培养阴性预测指标的性能。
Einstein (Sao Paulo). 2017 Jan-Mar;15(1):34-39. doi: 10.1590/S1679-45082017AO3936.
10
Analysis of the spatial distribution of dengue cases in the city of Rio de Janeiro, 2011 and 2012.2011年和2012年里约热内卢市登革热病例的空间分布分析。
Rev Saude Publica. 2017 Aug 17;51:79. doi: 10.11606/S1518-8787.2017051006239.

引用本文的文献

1
Epidemiological patterns of SARS-CoV-2 reinfections in Espírito Santo, Brazil: A population-based analysis using integrated surveillance and vaccination data.巴西圣埃斯皮里图州新冠病毒再次感染的流行病学模式:一项基于人群的分析,使用综合监测和疫苗接种数据
PLoS One. 2025 Sep 10;20(9):e0331771. doi: 10.1371/journal.pone.0331771. eCollection 2025.
2
[Data integration for the prevention of violence against girls and women in Northeastern BrazilIntegración de datos para la prevención de la violencia contra niñas y mujeres en el nordeste de Brasil].[巴西东北部预防暴力侵害女童和妇女行为的数据整合 巴西东北部预防暴力侵害女童和妇女行为的数据整合]
Rev Panam Salud Publica. 2025 Jun 17;49:e66. doi: 10.26633/RPSP.2025.66. eCollection 2025.
3
Accuracy, potential, and limitations of probabilistic record linkage in identifying deaths by gender identity and sexual orientation in the state of Rio De Janeiro, Brazil.巴西里约热内卢州基于性别认同和性取向识别死亡的概率记录链接的准确性、潜力和局限性。
BMC Public Health. 2024 Jun 1;24(1):1475. doi: 10.1186/s12889-024-19002-x.
4
Record Linkage for Malaria Deaths Data Recovery and Surveillance in Brazil.巴西疟疾死亡数据恢复与监测的记录链接
Trop Med Infect Dis. 2023 Dec 14;8(12):519. doi: 10.3390/tropicalmed8120519.
5
High Tuberculosis Density Incidence Rate in Matched Unrelated Allogeneic Stem Cell Transplantation Recipients in the State of São Paulo, Brazil.巴西圣保罗州匹配无关异基因干细胞移植受者中肺结核高发病率
Mediterr J Hematol Infect Dis. 2023 Jul 1;15(1):e2023037. doi: 10.4084/MJHID.2023.037. eCollection 2023.
6
Completeness and Factors Affecting Community Workers' Reporting of Births and Deaths in the Countrywide Mortality Surveillance for Action in Mozambique.全国死因监测行动中莫桑比克社区工作者报告出生和死亡情况的完整性及其影响因素。
Am J Trop Med Hyg. 2023 Apr 10;108(5_Suppl):29-39. doi: 10.4269/ajtmh.22-0537. Print 2023 May 2.
7
Analysis of the completeness of self-harm and suicide records in Pernambuco, Brazil, 2014-2016.2014-2016 年巴西伯南布哥州自残和自杀记录完整性分析。
BMC Public Health. 2022 Jun 9;22(1):1154. doi: 10.1186/s12889-022-13455-8.
8
Leveraging National Claims and Hospital Big Data: Cohort Study on a Statin-Drug Interaction Use Case.利用国家索赔数据和医院大数据:他汀类药物相互作用用例的队列研究。
JMIR Med Inform. 2021 Dec 13;9(12):e29286. doi: 10.2196/29286.
9
Demographic and Clinical Outcomes of Brazilian Patients With Stage III or IV Non-Small-Cell Lung Cancer: Real-World Evidence Study on the Basis of Deterministic Linkage Approach.基于确定性链接方法的真实世界证据研究:巴西 III 期或 IV 期非小细胞肺癌患者的人口统计学和临床结局。
JCO Glob Oncol. 2021 Sep;7:1454-1461. doi: 10.1200/GO.21.00228.
10
Bayesian evidence synthesis to estimate subnational TB incidence: An application in Brazil.贝叶斯证据综合评估国家级结核病发病率:巴西的应用。
Epidemics. 2021 Jun;35:100443. doi: 10.1016/j.epidem.2021.100443. Epub 2021 Feb 20.

本文引用的文献

1
Going open source: some lessons learned from the development of OpenRecLink.走向开源:从OpenRecLink开发中汲取的一些经验教训。
Cad Saude Publica. 2015 Feb;31(2):257-63. doi: 10.1590/0102-311x00041214.
2
[Improved quality of tuberculosis data using record linkage.].通过记录链接提高结核病数据质量。
Cad Saude Publica. 2014 Nov;30(11):2459-2470. doi: 10.1590/0102-311x00116313.
3
Accuracy of a probabilistic record-linkage methodology used to track blood donors in the Mortality Information System database.用于在死亡信息系统数据库中追踪献血者的概率性记录链接方法的准确性。
Cad Saude Publica. 2014 Aug;30(8):1623-32. doi: 10.1590/0102-311x00024914.
4
Accuracy of probabilistic record linkage in the assessment of high-complexity cardiology procedures.概率记录链接在评估高复杂性心脏病学程序中的准确性。
Rev Saude Publica. 2011 Apr;45(2):269-75. doi: 10.1590/s0034-89102011005000012. Epub 2011 Feb 25.
5
Accuracy of a probabilistic record linkage strategy applied to identify deaths among cases reported to the Brazilian AIDS surveillance database.应用概率记录链接策略识别报告给巴西艾滋病监测数据库的病例中的死亡情况的准确性。
Cad Saude Publica. 2010 Jul;26(7):1431-8. doi: 10.1590/s0102-311x2010000700022.
6
Duplicates and misclassification of tuberculosis notification records in Brazil, 2001-2007.巴西 2001-2007 年结核病报告记录的重复和分类错误。
Int J Tuberc Lung Dis. 2010 May;14(5):593-9.
7
Accuracy of probabilistic record linkage applied to health databases: systematic review.概率记录链接在健康数据库中的准确性:系统评价。
Rev Saude Publica. 2009 Oct;43(5):875-82. doi: 10.1590/s0034-89102009005000060. Epub 2009 Sep 25.
8
Validation of a hierarchical deterministic record-linkage algorithm using data from 2 different cohorts of human immunodeficiency virus-infected persons and mortality databases in Brazil.使用来自巴西两组不同的人类免疫缺陷病毒感染者队列和死亡率数据库的数据,对一种分层确定性记录链接算法进行验证。
Am J Epidemiol. 2008 Dec 1;168(11):1326-32. doi: 10.1093/aje/kwn249. Epub 2008 Oct 9.
9
Accuracy of public health data linkages.公共卫生数据关联的准确性。
Matern Child Health J. 2009 Jul;13(4):531-8. doi: 10.1007/s10995-008-0377-6. Epub 2008 Jun 24.
10
Completeness of tuberculosis control program records in the case registry database of the state of Espírito Santo, Brazil: analysis of the 2001-2005 period.巴西圣埃斯皮里图州病例登记数据库中结核病控制项目记录的完整性:2001 - 2005年期间分析
J Bras Pneumol. 2008 Apr;34(4):225-9. doi: 10.1590/s1806-37132008000400007.