一种基于高效学习的自动记录去重方法及基准数据集

An efficient learning based approach for automatic record deduplication with benchmark datasets.

作者信息

Ravikanth M, Korra Sampath, Mamidisetti Gowtham, Goutham Maganti, Bhaskar T

机构信息

Department of CSE, Malla Reddy University, Maisammaguda, Kompally, Hyderabad, India.

Department of CSE, Sri Indu College of Engineering and Technology (A), Sheriguda, Ibrahimpatnam, Hyderabad, T.S, 501510, India.

出版信息

Sci Rep. 2024 Jul 15;14(1):16254. doi: 10.1038/s41598-024-63242-1.

DOI:10.1038/s41598-024-63242-1

PMID:39009682

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11251143/

Abstract

With technological innovations, enterprises in the real world are managing every iota of data as it can be mined to derive business intelligence (BI). However, when data comes from multiple sources, it may result in duplicate records. As data is given paramount importance, it is also significant to eliminate duplicate entities towards data integration, performance and resource optimization. To realize reliable systems for record deduplication, late, deep learning could offer exciting provisions with a learning-based approach. Deep ER is one of the deep learning-based methods used recently for dealing with the elimination of duplicates in structured data. Using it as a reference model, in this paper, we propose a framework known as Enhanced Deep Learning-based Record Deduplication (EDL-RD) for improving performance further. Towards this end, we exploited a variant of Long Short Term Memory (LSTM) along with various attribute compositions, similarity metrics, and numerical and null value resolution. We proposed an algorithm known as Efficient Learning based Record Deduplication (ELbRD). The algorithm extends the reference model with the aforementioned enhancements. An empirical study has revealed that the proposed framework with extensions outperforms existing methods.

摘要

随着技术创新，现实世界中的企业正在管理每一个数据片段，因为这些数据可以被挖掘以获取商业智能（BI）。然而，当数据来自多个来源时，可能会导致重复记录。由于数据被赋予了至关重要的地位，消除重复实体对于数据集成、性能和资源优化也具有重要意义。为了实现可靠的记录去重系统，深度学习最近可以通过基于学习的方法提供令人兴奋的解决方案。深度ER是最近用于处理结构化数据中重复项消除的基于深度学习的方法之一。以它作为参考模型，在本文中，我们提出了一个名为基于增强深度学习的记录去重（EDL-RD）的框架，以进一步提高性能。为此，我们利用了长短期记忆（LSTM）的一个变体以及各种属性组合、相似性度量以及数值和空值解析。我们提出了一种名为基于高效学习的记录去重（ELbRD）的算法。该算法通过上述增强功能扩展了参考模型。实证研究表明，所提出的带有扩展的框架优于现有方法。