Suppr超能文献

核苷酸数据库中重复检测方法的测量基准。

Benchmarks for measurement of duplicate detection methods in nucleotide databases.

作者信息

Chen Qingyu, Zobel Justin, Verspoor Karin

机构信息

Department of Computing and Information Systems, The University of Melbourne, Parkville, VIC 3010, Australia.

出版信息

Database (Oxford). 2017 Jan 8;2023. doi: 10.1093/database/baw164.

Abstract

UNLABELLED

Duplication of information in databases is a major data quality challenge. The presence of duplicates, implying either redundancy or inconsistency, can have a range of impacts on the quality of analyses that use the data. To provide a sound basis for research on this issue in databases of nucleotide sequences, we have developed new, large-scale validated collections of duplicates, which can be used to test the effectiveness of duplicate detection methods. Previous collections were either designed primarily to test efficiency, or contained only a limited number of duplicates of limited kinds. To date, duplicate detection methods have been evaluated on separate, inconsistent benchmarks, leading to results that cannot be compared and, due to limitations of the benchmarks, of questionable generality. In this study, we present three nucleotide sequence database benchmarks, based on information drawn from a range of resources, including information derived from mapping to two data sections within the UniProt Knowledgebase (UniProtKB), UniProtKB/Swiss-Prot and UniProtKB/TrEMBL. Each benchmark has distinct characteristics. We quantify these characteristics and argue for their complementary value in evaluation. The benchmarks collectively contain a vast number of validated biological duplicates; the largest has nearly half a billion duplicate pairs (although this is probably only a tiny fraction of the total that is present). They are also the first benchmarks targeting the primary nucleotide databases. The records include the 21 most heavily studied organisms in molecular biology research. Our quantitative analysis shows that duplicates in the different benchmarks, and in different organisms, have different characteristics. It is thus unreliable to evaluate duplicate detection methods against any single benchmark. For example, the benchmark derived from UniProtKB/Swiss-Prot mappings identifies more diverse types of duplicates, showing the importance of expert curation, but is limited to coding sequences. Overall, these benchmarks form a resource that we believe will be of great value for development and evaluation of the duplicate detection or record linkage methods that are required to help maintain these essential resources.

DATABASE URL

: https://bitbucket.org/biodbqual/benchmarks.

摘要

未标注

数据库中的信息重复是一个重大的数据质量挑战。重复数据的存在,意味着冗余或不一致,可能会对使用这些数据的分析质量产生一系列影响。为了在核苷酸序列数据库中为该问题的研究提供坚实的基础,我们开发了新的、经过大规模验证的重复数据集合,可用于测试重复数据检测方法的有效性。以前的集合要么主要是为了测试效率而设计,要么只包含有限种类的少量重复数据。到目前为止,重复数据检测方法是在单独的、不一致的基准上进行评估的,这导致结果无法比较,并且由于基准的局限性,其通用性也存在疑问。在本研究中,我们基于从一系列资源中获取的信息,提出了三个核苷酸序列数据库基准,包括从映射到UniProt知识库(UniProtKB)中的两个数据部分、UniProtKB/Swiss-Prot和UniProtKB/TrEMBL中获得的信息。每个基准都有不同的特点。我们对这些特点进行了量化,并论证了它们在评估中的互补价值。这些基准共同包含了大量经过验证的生物学重复数据;最大的基准有近5亿对重复数据(尽管这可能只是实际存在总数的一小部分)。它们也是首批针对主要核苷酸数据库的基准。记录包括分子生物学研究中研究最多的21种生物。我们的定量分析表明,不同基准以及不同生物中的重复数据具有不同的特征。因此,仅根据任何单一基准来评估重复数据检测方法是不可靠的。例如,从UniProtKB/Swiss-Prot映射派生的基准识别出更多样化的重复数据类型,显示了专家策划的重要性,但仅限于编码序列。总体而言,这些基准形成了一种资源,我们认为这对于开发和评估维护这些重要资源所需的重复数据检测或记录链接方法将具有巨大价值。

数据库网址

https://bitbucket.org/biodbqual/benchmarks

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/264c/10755258/95328d6f12f4/baw164f1p.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验