• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

使用完全连锁聚类的高效记录链接算法。

Efficient Record Linkage Algorithms Using Complete Linkage Clustering.

作者信息

Mamun Abdullah-Al, Aseltine Robert, Rajasekaran Sanguthevar

机构信息

Department of Computer Science and Engineering, University of Connecticut, Storrs, Connecticut, United States of America.

Institute for Public Health Research, University of Connecticut, East Hartford, Connecticut, United States of America.

出版信息

PLoS One. 2016 Apr 28;11(4):e0154446. doi: 10.1371/journal.pone.0154446. eCollection 2016.

DOI:10.1371/journal.pone.0154446
PMID:27124604
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4849582/
Abstract

Data from different agencies share data of the same individuals. Linking these datasets to identify all the records belonging to the same individuals is a crucial and challenging problem, especially given the large volumes of data. A large number of available algorithms for record linkage are prone to either time inefficiency or low-accuracy in finding matches and non-matches among the records. In this paper we propose efficient as well as reliable sequential and parallel algorithms for the record linkage problem employing hierarchical clustering methods. We employ complete linkage hierarchical clustering algorithms to address this problem. In addition to hierarchical clustering, we also use two other techniques: elimination of duplicate records and blocking. Our algorithms use sorting as a sub-routine to identify identical copies of records. We have tested our algorithms on datasets with millions of synthetic records. Experimental results show that our algorithms achieve nearly 100% accuracy. Parallel implementations achieve almost linear speedups. Time complexities of these algorithms do not exceed those of previous best-known algorithms. Our proposed algorithms outperform previous best-known algorithms in terms of accuracy consuming reasonable run times.

摘要

来自不同机构的数据共享相同个体的数据。将这些数据集链接起来以识别属于同一人的所有记录是一个关键且具有挑战性的问题,尤其是考虑到数据量巨大。大量现有的记录链接算法在查找记录之间的匹配项和不匹配项时,要么效率低下,要么准确率低。在本文中,我们提出了使用层次聚类方法解决记录链接问题的高效且可靠的顺序和并行算法。我们采用完全链接层次聚类算法来解决这个问题。除了层次聚类,我们还使用另外两种技术:消除重复记录和分块。我们的算法使用排序作为子例程来识别记录的相同副本。我们在包含数百万条合成记录的数据集上测试了我们的算法。实验结果表明,我们的算法准确率接近100%。并行实现几乎实现了线性加速。这些算法的时间复杂度不超过之前最著名算法的时间复杂度。我们提出的算法在准确率方面优于之前最著名的算法,且运行时间合理。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c538/4849582/302e9c62d383/pone.0154446.g009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c538/4849582/dadcca217720/pone.0154446.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c538/4849582/0f1c902c3aac/pone.0154446.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c538/4849582/bc99b85d2fed/pone.0154446.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c538/4849582/83862c79b2ad/pone.0154446.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c538/4849582/9f9e37504c2c/pone.0154446.g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c538/4849582/d175c1f196a5/pone.0154446.g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c538/4849582/d1d51169d16c/pone.0154446.g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c538/4849582/11612c048bdc/pone.0154446.g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c538/4849582/302e9c62d383/pone.0154446.g009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c538/4849582/dadcca217720/pone.0154446.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c538/4849582/0f1c902c3aac/pone.0154446.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c538/4849582/bc99b85d2fed/pone.0154446.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c538/4849582/83862c79b2ad/pone.0154446.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c538/4849582/9f9e37504c2c/pone.0154446.g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c538/4849582/d175c1f196a5/pone.0154446.g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c538/4849582/d1d51169d16c/pone.0154446.g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c538/4849582/11612c048bdc/pone.0154446.g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c538/4849582/302e9c62d383/pone.0154446.g009.jpg

相似文献

1
Efficient Record Linkage Algorithms Using Complete Linkage Clustering.使用完全连锁聚类的高效记录链接算法。
PLoS One. 2016 Apr 28;11(4):e0154446. doi: 10.1371/journal.pone.0154446. eCollection 2016.
2
Efficient sequential and parallel algorithms for record linkage.高效的记录链接顺序和并行算法。
J Am Med Inform Assoc. 2014 Mar-Apr;21(2):252-62. doi: 10.1136/amiajnl-2013-002034. Epub 2013 Oct 23.
3
FIRLA: a Fast Incremental Record Linkage Algorithm.FIRLA:一种快速增量记录链接算法。
J Biomed Inform. 2022 Jun;130:104094. doi: 10.1016/j.jbi.2022.104094. Epub 2022 May 10.
4
RLT-S: A Web System for Record Linkage.RLT-S:一个用于记录链接的网络系统。
PLoS One. 2015 May 5;10(5):e0124449. doi: 10.1371/journal.pone.0124449. eCollection 2015.
5
A new computationally efficient algorithm for record linkage with field dependency and missing data imputation.一种新的具有字段依赖性和缺失数据插补功能的计算效率高的记录链接算法。
Int J Med Inform. 2018 Jan;109:70-75. doi: 10.1016/j.ijmedinf.2017.10.021. Epub 2017 Nov 6.
6
Comparing record linkage software programs and algorithms using real-world data.使用真实世界的数据比较记录链接软件程序和算法。
PLoS One. 2019 Sep 24;14(9):e0221459. doi: 10.1371/journal.pone.0221459. eCollection 2019.
7
A comparison of accuracy and computational feasibility of two record linkage algorithms in retrieving vital status information from HIV/AIDS patients registered in Brazilian public databases.两种记录链接算法在从巴西公共数据库中检索艾滋病毒/艾滋病患者生命状态信息方面的准确性和计算可行性比较。
Int J Med Inform. 2018 Jun;114:45-51. doi: 10.1016/j.ijmedinf.2018.03.005. Epub 2018 Mar 20.
8
An efficient record linkage scheme using graphical analysis for identifier error detection.一种使用图形分析的高效记录链接方案,用于标识符错误检测。
BMC Med Inform Decis Mak. 2011 Feb 1;11:7. doi: 10.1186/1472-6947-11-7.
9
Variable selection for latent class analysis in the presence of missing data with application to record linkage.存在缺失数据时的潜在类别分析的变量选择及其在记录链接中的应用。
Stat Methods Med Res. 2024 Jun;33(6):966-980. doi: 10.1177/09622802241242317. Epub 2024 Apr 9.
10
An open-source probabilistic record linkage process for records with family-level information: Simulation study and applied analysis.具有家庭级信息的记录的开源概率记录链接过程:模拟研究和应用分析。
PLoS One. 2023 Oct 20;18(10):e0291581. doi: 10.1371/journal.pone.0291581. eCollection 2023.

引用本文的文献

1
A Machine Learning-Based Clustering Analysis to Explore Bisphenol A and Phthalate Exposure from Medical Devices in Infants with Congenital Heart Defects.基于机器学习的聚类分析,以探究先天性心脏病婴儿医疗器械中双酚A和邻苯二甲酸盐的暴露情况。
Environ Health Perspect. 2025 Jun;133(6):67016. doi: 10.1289/EHP15034. Epub 2025 Jun 18.
2
Clinical Characteristics of Psoriasis for Initiation of Biologic Therapy: A Cluster Analysis.用于启动生物治疗的银屑病临床特征:一项聚类分析。
Ann Dermatol. 2023 Apr;35(2):132-139. doi: 10.5021/ad.22.148.
3
A fast privacy-preserving patient record linkage of time series data.

本文引用的文献

1
Efficient sequential and parallel algorithms for record linkage.高效的记录链接顺序和并行算法。
J Am Med Inform Assoc. 2014 Mar-Apr;21(2):252-62. doi: 10.1136/amiajnl-2013-002034. Epub 2013 Oct 23.
2
The development of a data-matching algorithm to define the 'case patient'.用于定义“病例患者”的数据匹配算法的开发。
Aust Health Rev. 2013 Feb;37(1):54-9. doi: 10.1071/AH11161.
3
Efficient algorithms for fast integration on large data sets from multiple sources.从多个来源快速集成大型数据集的高效算法。
一种快速的隐私保护的时间序列数据患者记录链接方法。
Sci Rep. 2023 Feb 25;13(1):3292. doi: 10.1038/s41598-023-29132-8.
4
The emerging landscape of health research based on biobanks linked to electronic health records: Existing resources, statistical challenges, and potential opportunities.基于与电子健康记录相关联的生物银行的健康研究的新兴领域:现有资源、统计挑战和潜在机会。
Stat Med. 2020 Mar 15;39(6):773-800. doi: 10.1002/sim.8445. Epub 2019 Dec 20.
5
Preparing next-generation scientists for biomedical big data: artificial intelligence approaches.为生物医学大数据培养下一代科学家:人工智能方法。
Per Med. 2019 May 1;16(3):247-257. doi: 10.2217/pme-2018-0145. Epub 2019 Feb 14.
BMC Med Inform Decis Mak. 2012 Jun 28;12:59. doi: 10.1186/1472-6947-12-59.
4
Linkage of patient records from disparate sources.来自不同来源的患者记录的链接。
Stat Methods Med Res. 2013 Feb;22(1):31-8. doi: 10.1177/0962280211403600. Epub 2011 Jun 10.
5
Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources.利用DAVID生物信息学资源对大型基因列表进行系统和综合分析。
Nat Protoc. 2009;4(1):44-57. doi: 10.1038/nprot.2008.211.
6
FRIL: A tool for comparative record linkage.FRIL:一种用于比较记录链接的工具。
AMIA Annu Symp Proc. 2008 Nov 6;2008:440-4.
7
Fine-grained record integration and linkage tool.细粒度记录整合与链接工具。
Birth Defects Res A Clin Mol Teratol. 2008 Nov;82(11):822-9. doi: 10.1002/bdra.20521.
8
The use and reporting of cluster analysis in health psychology: a review.健康心理学中聚类分析的应用与报告:一项综述
Br J Health Psychol. 2005 Sep;10(Pt 3):329-58. doi: 10.1348/135910705X25697.
9
A record linkage protocol for a diabetes registry at ethnically diverse community health centers.一个针对种族多元化社区健康中心的糖尿病登记处的记录链接协议。
J Am Med Inform Assoc. 2005 May-Jun;12(3):331-7. doi: 10.1197/jamia.M1696. Epub 2005 Jan 31.
10
Practical introduction to record linkage for injury research.伤害研究中记录链接的实践介绍。
Inj Prev. 2004 Jun;10(3):186-91. doi: 10.1136/ip.2003.004580.