• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

FDup:一个用于记录集通用且高效实体去重的框架。

FDup: a framework for general-purpose and efficient entity deduplication of record collections.

作者信息

De Bonis Michele, Manghi Paolo, Atzori Claudio

机构信息

Istituto di Scienza e Tecnologie dell'Informazione "A. Faedo" (ISTI), Consiglio Nazionale delle Ricerche (CNR), Pisa, Italy.

出版信息

PeerJ Comput Sci. 2022 Sep 6;8:e1058. doi: 10.7717/peerj-cs.1058. eCollection 2022.

DOI:10.7717/peerj-cs.1058
PMID:36262137
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9575841/
Abstract

Deduplication is a technique aiming at identifying and resolving duplicate metadata records in a collection. This article describes FDup (Flat Collections Deduper), a general-purpose software framework supporting a complete deduplication workflow to manage big data record collections: metadata record data model definition, identification of candidate duplicates, identification of duplicates. FDup brings two main innovations: first, it delivers a full deduplication framework in a single easy-to-use software package based on Apache Spark Hadoop framework, where developers can customize the optimal and parallel workflow steps of blocking, sliding windows, and similarity matching function via an intuitive configuration file; second, it introduces a novel approach to improve performance, beyond the known techniques of "blocking" and "sliding window", by introducing a smart similarity matching function T-match. T-match is engineered as a decision tree that drives the comparisons of the fields of two records as branches of predicates and allows for successful or unsuccessful early-exit strategies. The efficacy of the approach is proved by experiments performed over big data collections of metadata records in the OpenAIRE Research Graph, a known open access knowledge base in Scholarly communication.

摘要

去重是一种旨在识别和解决集合中重复元数据记录的技术。本文介绍了FDup(平面集合去重器),这是一个通用软件框架,支持完整的去重工作流程以管理大数据记录集合:元数据记录数据模型定义、候选重复项识别、重复项识别。FDup带来了两项主要创新:第一,它基于Apache Spark Hadoop框架在一个易于使用的软件包中提供了一个完整的去重框架,开发人员可以通过直观的配置文件自定义阻塞、滑动窗口和相似性匹配函数的最佳并行工作流程步骤;第二,它引入了一种新颖的方法来提高性能,除了“阻塞”和“滑动窗口”等已知技术外,还引入了智能相似性匹配函数T-match。T-match被设计为一棵决策树,它将两个记录的字段比较作为谓词分支,并允许成功或不成功的早期退出策略。通过在OpenAIRE研究图谱(学术交流中一个知名的开放获取知识库)中的元数据记录大数据集合上进行的实验,证明了该方法的有效性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2320/9575841/800724bcd89d/peerj-cs-08-1058-g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2320/9575841/e0ca3534543e/peerj-cs-08-1058-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2320/9575841/03d3dc336533/peerj-cs-08-1058-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2320/9575841/265856ea7819/peerj-cs-08-1058-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2320/9575841/fe86984eddda/peerj-cs-08-1058-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2320/9575841/500292d4a129/peerj-cs-08-1058-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2320/9575841/d6f640b65fa7/peerj-cs-08-1058-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2320/9575841/487d536d2d06/peerj-cs-08-1058-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2320/9575841/800724bcd89d/peerj-cs-08-1058-g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2320/9575841/e0ca3534543e/peerj-cs-08-1058-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2320/9575841/03d3dc336533/peerj-cs-08-1058-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2320/9575841/265856ea7819/peerj-cs-08-1058-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2320/9575841/fe86984eddda/peerj-cs-08-1058-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2320/9575841/500292d4a129/peerj-cs-08-1058-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2320/9575841/d6f640b65fa7/peerj-cs-08-1058-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2320/9575841/487d536d2d06/peerj-cs-08-1058-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2320/9575841/800724bcd89d/peerj-cs-08-1058-g008.jpg

相似文献

1
FDup: a framework for general-purpose and efficient entity deduplication of record collections.FDup:一个用于记录集通用且高效实体去重的框架。
PeerJ Comput Sci. 2022 Sep 6;8:e1058. doi: 10.7717/peerj-cs.1058. eCollection 2022.
2
SecDedoop: Secure Deduplication with Access Control of Big Data in the HDFS/Hadoop Environment.SecDedoop:HDFS/Hadoop 环境中具有大数据访问控制的安全去重。
Big Data. 2020 Apr;8(2):147-163. doi: 10.1089/big.2019.0120.
3
An efficient learning based approach for automatic record deduplication with benchmark datasets.一种基于高效学习的自动记录去重方法及基准数据集
Sci Rep. 2024 Jul 15;14(1):16254. doi: 10.1038/s41598-024-63242-1.
4
Better duplicate detection for systematic reviewers: evaluation of Systematic Review Assistant-Deduplication Module.为系统评价者提供更好的重复检测:系统评价助手-重复数据删除模块的评估
Syst Rev. 2015 Jan 14;4(1):6. doi: 10.1186/2046-4053-4-6.
5
Reducing systematic review burden using Deduklick: a novel, automated, reliable, and explainable deduplication algorithm to foster medical research.利用 Deduklick 减少系统综述负担:一种新颖、自动化、可靠且可解释的去重算法,以促进医学研究。
Syst Rev. 2022 Aug 17;11(1):172. doi: 10.1186/s13643-022-02045-9.
6
Big Data in metagenomics: Apache Spark vs MPI.宏基因组学中的大数据:Apache Spark 与 MPI。
PLoS One. 2020 Oct 6;15(10):e0239741. doi: 10.1371/journal.pone.0239741. eCollection 2020.
7
The Data-Adaptive Fellegi-Sunter Model for Probabilistic Record Linkage: Algorithm Development and Validation for Incorporating Missing Data and Field Selection.数据自适应 Fellegi-Sunter 模型在概率记录链接中的应用:纳入缺失数据和字段选择的算法开发和验证。
J Med Internet Res. 2022 Sep 29;24(9):e33775. doi: 10.2196/33775.
8
Secure and scalable deduplication of horizontally partitioned health data for privacy-preserving distributed statistical computation.用于隐私保护分布式统计计算的水平分区健康数据的安全且可扩展的重复数据删除
BMC Med Inform Decis Mak. 2017 Jan 3;17(1):1. doi: 10.1186/s12911-016-0389-x.
9
FASTA/Q data compressors for MapReduce-Hadoop genomics: space and time savings made easy.FASTA/Q 数据压缩器在 MapReduce-Hadoop 基因组学中的应用:轻松节省空间和时间。
BMC Bioinformatics. 2021 Mar 22;22(1):144. doi: 10.1186/s12859-021-04063-1.
10
Sharing and organizing research products as R packages.以 R 包的形式共享和组织研究产品。
Behav Res Methods. 2021 Apr;53(2):792-802. doi: 10.3758/s13428-020-01436-x.