Suppr超能文献

FDup:一个用于记录集通用且高效实体去重的框架。

FDup: a framework for general-purpose and efficient entity deduplication of record collections.

作者信息

De Bonis Michele, Manghi Paolo, Atzori Claudio

机构信息

Istituto di Scienza e Tecnologie dell'Informazione "A. Faedo" (ISTI), Consiglio Nazionale delle Ricerche (CNR), Pisa, Italy.

出版信息

PeerJ Comput Sci. 2022 Sep 6;8:e1058. doi: 10.7717/peerj-cs.1058. eCollection 2022.

Abstract

Deduplication is a technique aiming at identifying and resolving duplicate metadata records in a collection. This article describes FDup (Flat Collections Deduper), a general-purpose software framework supporting a complete deduplication workflow to manage big data record collections: metadata record data model definition, identification of candidate duplicates, identification of duplicates. FDup brings two main innovations: first, it delivers a full deduplication framework in a single easy-to-use software package based on Apache Spark Hadoop framework, where developers can customize the optimal and parallel workflow steps of blocking, sliding windows, and similarity matching function via an intuitive configuration file; second, it introduces a novel approach to improve performance, beyond the known techniques of "blocking" and "sliding window", by introducing a smart similarity matching function T-match. T-match is engineered as a decision tree that drives the comparisons of the fields of two records as branches of predicates and allows for successful or unsuccessful early-exit strategies. The efficacy of the approach is proved by experiments performed over big data collections of metadata records in the OpenAIRE Research Graph, a known open access knowledge base in Scholarly communication.

摘要

去重是一种旨在识别和解决集合中重复元数据记录的技术。本文介绍了FDup(平面集合去重器),这是一个通用软件框架,支持完整的去重工作流程以管理大数据记录集合:元数据记录数据模型定义、候选重复项识别、重复项识别。FDup带来了两项主要创新:第一,它基于Apache Spark Hadoop框架在一个易于使用的软件包中提供了一个完整的去重框架,开发人员可以通过直观的配置文件自定义阻塞、滑动窗口和相似性匹配函数的最佳并行工作流程步骤;第二,它引入了一种新颖的方法来提高性能,除了“阻塞”和“滑动窗口”等已知技术外,还引入了智能相似性匹配函数T-match。T-match被设计为一棵决策树,它将两个记录的字段比较作为谓词分支,并允许成功或不成功的早期退出策略。通过在OpenAIRE研究图谱(学术交流中一个知名的开放获取知识库)中的元数据记录大数据集合上进行的实验,证明了该方法的有效性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2320/9575841/e0ca3534543e/peerj-cs-08-1058-g001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验