• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

一种基于高效学习的自动记录去重方法及基准数据集

An efficient learning based approach for automatic record deduplication with benchmark datasets.

作者信息

Ravikanth M, Korra Sampath, Mamidisetti Gowtham, Goutham Maganti, Bhaskar T

机构信息

Department of CSE, Malla Reddy University, Maisammaguda, Kompally, Hyderabad, India.

Department of CSE, Sri Indu College of Engineering and Technology (A), Sheriguda, Ibrahimpatnam, Hyderabad, T.S, 501510, India.

出版信息

Sci Rep. 2024 Jul 15;14(1):16254. doi: 10.1038/s41598-024-63242-1.

DOI:10.1038/s41598-024-63242-1
PMID:39009682
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11251143/
Abstract

With technological innovations, enterprises in the real world are managing every iota of data as it can be mined to derive business intelligence (BI). However, when data comes from multiple sources, it may result in duplicate records. As data is given paramount importance, it is also significant to eliminate duplicate entities towards data integration, performance and resource optimization. To realize reliable systems for record deduplication, late, deep learning could offer exciting provisions with a learning-based approach. Deep ER is one of the deep learning-based methods used recently for dealing with the elimination of duplicates in structured data. Using it as a reference model, in this paper, we propose a framework known as Enhanced Deep Learning-based Record Deduplication (EDL-RD) for improving performance further. Towards this end, we exploited a variant of Long Short Term Memory (LSTM) along with various attribute compositions, similarity metrics, and numerical and null value resolution. We proposed an algorithm known as Efficient Learning based Record Deduplication (ELbRD). The algorithm extends the reference model with the aforementioned enhancements. An empirical study has revealed that the proposed framework with extensions outperforms existing methods.

摘要

随着技术创新,现实世界中的企业正在管理每一个数据片段,因为这些数据可以被挖掘以获取商业智能(BI)。然而,当数据来自多个来源时,可能会导致重复记录。由于数据被赋予了至关重要的地位,消除重复实体对于数据集成、性能和资源优化也具有重要意义。为了实现可靠的记录去重系统,深度学习最近可以通过基于学习的方法提供令人兴奋的解决方案。深度ER是最近用于处理结构化数据中重复项消除的基于深度学习的方法之一。以它作为参考模型,在本文中,我们提出了一个名为基于增强深度学习的记录去重(EDL-RD)的框架,以进一步提高性能。为此,我们利用了长短期记忆(LSTM)的一个变体以及各种属性组合、相似性度量以及数值和空值解析。我们提出了一种名为基于高效学习的记录去重(ELbRD)的算法。该算法通过上述增强功能扩展了参考模型。实证研究表明,所提出的带有扩展的框架优于现有方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fc8d/11251143/0ef865aceb0f/41598_2024_63242_Fig13_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fc8d/11251143/8238a266bd45/41598_2024_63242_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fc8d/11251143/dacfe064b8cc/41598_2024_63242_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fc8d/11251143/b6680e75b6db/41598_2024_63242_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fc8d/11251143/4915bb009405/41598_2024_63242_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fc8d/11251143/7a5a617c439d/41598_2024_63242_Figa_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fc8d/11251143/e6716d4713c5/41598_2024_63242_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fc8d/11251143/1966ef9789fb/41598_2024_63242_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fc8d/11251143/6ea66e216d42/41598_2024_63242_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fc8d/11251143/187ad24c82bd/41598_2024_63242_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fc8d/11251143/9cb6bd106bdd/41598_2024_63242_Fig9_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fc8d/11251143/292ba89ae77b/41598_2024_63242_Fig10_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fc8d/11251143/eea4c50b194c/41598_2024_63242_Fig11_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fc8d/11251143/92e90006a852/41598_2024_63242_Fig12_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fc8d/11251143/0ef865aceb0f/41598_2024_63242_Fig13_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fc8d/11251143/8238a266bd45/41598_2024_63242_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fc8d/11251143/dacfe064b8cc/41598_2024_63242_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fc8d/11251143/b6680e75b6db/41598_2024_63242_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fc8d/11251143/4915bb009405/41598_2024_63242_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fc8d/11251143/7a5a617c439d/41598_2024_63242_Figa_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fc8d/11251143/e6716d4713c5/41598_2024_63242_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fc8d/11251143/1966ef9789fb/41598_2024_63242_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fc8d/11251143/6ea66e216d42/41598_2024_63242_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fc8d/11251143/187ad24c82bd/41598_2024_63242_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fc8d/11251143/9cb6bd106bdd/41598_2024_63242_Fig9_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fc8d/11251143/292ba89ae77b/41598_2024_63242_Fig10_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fc8d/11251143/eea4c50b194c/41598_2024_63242_Fig11_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fc8d/11251143/92e90006a852/41598_2024_63242_Fig12_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fc8d/11251143/0ef865aceb0f/41598_2024_63242_Fig13_HTML.jpg

相似文献

1
An efficient learning based approach for automatic record deduplication with benchmark datasets.一种基于高效学习的自动记录去重方法及基准数据集
Sci Rep. 2024 Jul 15;14(1):16254. doi: 10.1038/s41598-024-63242-1.
2
FDup: a framework for general-purpose and efficient entity deduplication of record collections.FDup:一个用于记录集通用且高效实体去重的框架。
PeerJ Comput Sci. 2022 Sep 6;8:e1058. doi: 10.7717/peerj-cs.1058. eCollection 2022.
3
Time series forecasting of new cases and new deaths rate for COVID-19 using deep learning methods.使用深度学习方法对COVID-19的新增病例和新增死亡率进行时间序列预测。
Results Phys. 2021 Aug;27:104495. doi: 10.1016/j.rinp.2021.104495. Epub 2021 Jun 26.
4
Better duplicate detection for systematic reviewers: evaluation of Systematic Review Assistant-Deduplication Module.为系统评价者提供更好的重复检测:系统评价助手-重复数据删除模块的评估
Syst Rev. 2015 Jan 14;4(1):6. doi: 10.1186/2046-4053-4-6.
5
Reducing systematic review burden using Deduklick: a novel, automated, reliable, and explainable deduplication algorithm to foster medical research.利用 Deduklick 减少系统综述负担:一种新颖、自动化、可靠且可解释的去重算法,以促进医学研究。
Syst Rev. 2022 Aug 17;11(1):172. doi: 10.1186/s13643-022-02045-9.
6
A deep learning approach for Named Entity Recognition in Urdu language.一种用于乌尔都语命名实体识别的深度学习方法。
PLoS One. 2024 Mar 28;19(3):e0300725. doi: 10.1371/journal.pone.0300725. eCollection 2024.
7
An End-to-End Multi-Channel Convolutional Bi-LSTM Network for Automatic Sleep Stage Detection.端到端多通道卷积双向长短时记忆网络在自动睡眠分期检测中的应用。
Sensors (Basel). 2023 May 21;23(10):4950. doi: 10.3390/s23104950.
8
Automation of duplicate record detection for systematic reviews: Deduplicator.用于系统评价的重复记录检测自动化:Deduplicator。
Syst Rev. 2024 Aug 2;13(1):206. doi: 10.1186/s13643-024-02619-9.
9
LSTM Model for Prediction of Heart Failure in Big Data.基于大数据的心力衰竭预测 LSTM 模型
J Med Syst. 2019 Mar 19;43(5):111. doi: 10.1007/s10916-019-1243-3.
10
Water quality assessment of a river using deep learning Bi-LSTM methodology: forecasting and validation.基于深度学习 Bi-LSTM 方法的河流水质评估:预测与验证。
Environ Sci Pollut Res Int. 2022 Feb;29(9):12875-12889. doi: 10.1007/s11356-021-13875-w. Epub 2021 May 14.

本文引用的文献

1
Reducing systematic review burden using Deduklick: a novel, automated, reliable, and explainable deduplication algorithm to foster medical research.利用 Deduklick 减少系统综述负担:一种新颖、自动化、可靠且可解释的去重算法,以促进医学研究。
Syst Rev. 2022 Aug 17;11(1):172. doi: 10.1186/s13643-022-02045-9.
2
Artificial intelligence approaches and mechanisms for big data analytics: a systematic study.用于大数据分析的人工智能方法与机制:一项系统研究。
PeerJ Comput Sci. 2021 Apr 14;7:e488. doi: 10.7717/peerj-cs.488. eCollection 2021.
3
Identifying and characterizing highly similar notes in big clinical note datasets.
在大型临床笔记数据集 中识别和描述高度相似的笔记。
J Biomed Inform. 2018 Jun;82:63-69. doi: 10.1016/j.jbi.2018.04.009. Epub 2018 Apr 19.
4
On the Accuracy and Scalability of Probabilistic Data Linkage Over the Brazilian 114 Million Cohort.基于巴西 1.14 亿队列的概率数据链接的准确性和可扩展性研究
IEEE J Biomed Health Inform. 2018 Mar;22(2):346-353. doi: 10.1109/JBHI.2018.2796941.
5
Long short-term memory.长短期记忆
Neural Comput. 1997 Nov 15;9(8):1735-80. doi: 10.1162/neco.1997.9.8.1735.