核苷酸数据库中重复检测方法的测量基准。

Benchmarks for measurement of duplicate detection methods in nucleotide databases.

作者信息

Chen Qingyu, Zobel Justin, Verspoor Karin

机构信息

Department of Computing and Information Systems, The University of Melbourne, Parkville, VIC 3010, Australia.

出版信息

Database (Oxford). 2017 Jan 8;2023. doi: 10.1093/database/baw164.

DOI:10.1093/database/baw164

PMID:28334741

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10755258/

Abstract

UNLABELLED

Duplication of information in databases is a major data quality challenge. The presence of duplicates, implying either redundancy or inconsistency, can have a range of impacts on the quality of analyses that use the data. To provide a sound basis for research on this issue in databases of nucleotide sequences, we have developed new, large-scale validated collections of duplicates, which can be used to test the effectiveness of duplicate detection methods. Previous collections were either designed primarily to test efficiency, or contained only a limited number of duplicates of limited kinds. To date, duplicate detection methods have been evaluated on separate, inconsistent benchmarks, leading to results that cannot be compared and, due to limitations of the benchmarks, of questionable generality. In this study, we present three nucleotide sequence database benchmarks, based on information drawn from a range of resources, including information derived from mapping to two data sections within the UniProt Knowledgebase (UniProtKB), UniProtKB/Swiss-Prot and UniProtKB/TrEMBL. Each benchmark has distinct characteristics. We quantify these characteristics and argue for their complementary value in evaluation. The benchmarks collectively contain a vast number of validated biological duplicates; the largest has nearly half a billion duplicate pairs (although this is probably only a tiny fraction of the total that is present). They are also the first benchmarks targeting the primary nucleotide databases. The records include the 21 most heavily studied organisms in molecular biology research. Our quantitative analysis shows that duplicates in the different benchmarks, and in different organisms, have different characteristics. It is thus unreliable to evaluate duplicate detection methods against any single benchmark. For example, the benchmark derived from UniProtKB/Swiss-Prot mappings identifies more diverse types of duplicates, showing the importance of expert curation, but is limited to coding sequences. Overall, these benchmarks form a resource that we believe will be of great value for development and evaluation of the duplicate detection or record linkage methods that are required to help maintain these essential resources.

DATABASE URL

: https://bitbucket.org/biodbqual/benchmarks.

摘要

未标注

数据库中的信息重复是一个重大的数据质量挑战。重复数据的存在，意味着冗余或不一致，可能会对使用这些数据的分析质量产生一系列影响。为了在核苷酸序列数据库中为该问题的研究提供坚实的基础，我们开发了新的、经过大规模验证的重复数据集合，可用于测试重复数据检测方法的有效性。以前的集合要么主要是为了测试效率而设计，要么只包含有限种类的少量重复数据。到目前为止，重复数据检测方法是在单独的、不一致的基准上进行评估的，这导致结果无法比较，并且由于基准的局限性，其通用性也存在疑问。在本研究中，我们基于从一系列资源中获取的信息，提出了三个核苷酸序列数据库基准，包括从映射到UniProt知识库（UniProtKB）中的两个数据部分、UniProtKB/Swiss-Prot和UniProtKB/TrEMBL中获得的信息。每个基准都有不同的特点。我们对这些特点进行了量化，并论证了它们在评估中的互补价值。这些基准共同包含了大量经过验证的生物学重复数据；最大的基准有近5亿对重复数据（尽管这可能只是实际存在总数的一小部分）。它们也是首批针对主要核苷酸数据库的基准。记录包括分子生物学研究中研究最多的21种生物。我们的定量分析表明，不同基准以及不同生物中的重复数据具有不同的特征。因此，仅根据任何单一基准来评估重复数据检测方法是不可靠的。例如，从UniProtKB/Swiss-Prot映射派生的基准识别出更多样化的重复数据类型，显示了专家策划的重要性，但仅限于编码序列。总体而言，这些基准形成了一种资源，我们认为这对于开发和评估维护这些重要资源所需的重复数据检测或记录链接方法将具有巨大价值。

数据库网址

https://bitbucket.org/biodbqual/benchmarks 。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/264c/10755258/95328d6f12f4/baw164f1p.jpg

相似文献

Benchmarks for measurement of duplicate detection methods in nucleotide databases.核苷酸数据库中重复检测方法的测量基准。

Database (Oxford). 2017 Jan 8;2023. doi: 10.1093/database/baw164.

Duplicates, redundancies and inconsistencies in the primary nucleotide databases: a descriptive study.主要核苷酸数据库中的重复、冗余和不一致性：一项描述性研究。

Database (Oxford). 2017 Jan 10;2017. doi: 10.1093/database/baw163. Print 2017.

UniProtKB/Swiss-Prot.通用蛋白质知识库/瑞士蛋白质数据库

Methods Mol Biol. 2007;406:89-112. doi: 10.1007/978-1-59745-535-0_4.

UniProtKB/Swiss-Prot, the Manually Annotated Section of the UniProt KnowledgeBase: How to Use the Entry View.UniProtKB/Swiss-Prot，即UniProt知识库的人工注释部分：如何使用条目视图。

Methods Mol Biol. 2016;1374:23-54. doi: 10.1007/978-1-4939-3167-5_2.

SSMap: a new UniProt-PDB mapping resource for the curation of structural-related information in the UniProt/Swiss-Prot Knowledgebase.SSMap：一种用于在UniProt/Swiss-Prot知识库中整理结构相关信息的新型UniProt-PDB映射资源。

BMC Bioinformatics. 2008 Sep 23;9:391. doi: 10.1186/1471-2105-9-391.

The UniProtKB/Swiss-Prot knowledgebase and its Plant Proteome Annotation Program.通用蛋白质资源知识库/瑞士蛋白质数据库及其植物蛋白质组注释计划。

J Proteomics. 2009 Apr 13;72(3):567-73. doi: 10.1016/j.jprot.2008.11.010. Epub 2008 Nov 24.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区，服用抗叶酸抗疟药物的人群中，叶酸补充剂与疟疾易感性和严重程度的关系。

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

Annotating single amino acid polymorphisms in the UniProt/Swiss-Prot knowledgebase.在UniProt/Swiss-Prot知识库中注释单氨基酸多态性。

Hum Mutat. 2008 Mar;29(3):361-6. doi: 10.1002/humu.20671.

Plant protein annotation in the UniProt Knowledgebase.通用蛋白质数据库（UniProt Knowledgebase）中的植物蛋白质注释

Plant Physiol. 2005 May;138(1):59-66. doi: 10.1104/pp.104.058933.

Supervised Learning for Detection of Duplicates in Genomic Sequence Databases.用于检测基因组序列数据库中重复序列的监督学习

PLoS One. 2016 Aug 4;11(8):e0159644. doi: 10.1371/journal.pone.0159644. eCollection 2016.

引用本文的文献

Integration of background knowledge for automatic detection of inconsistencies in gene ontology annotation.背景知识的整合用于自动检测基因本体论注释中的不一致性。

Bioinformatics. 2024 Jun 28;40(Suppl 1):i390-i400. doi: 10.1093/bioinformatics/btae246.

Propagation, detection and correction of errors using the sequence database network.利用序列数据库网络进行错误的传播、检测和纠正。

Brief Bioinform. 2022 Nov 19;23(6). doi: 10.1093/bib/bbac416.

Exploring automatic inconsistency detection for literature-based gene ontology annotation.探索基于文献的基因本体论自动标注不一致性检测。

Bioinformatics. 2022 Jun 24;38(Suppl 1):i273-i281. doi: 10.1093/bioinformatics/btac230.

Exploration into the origins and mobilization of di-hydrofolate reductase genes and the emergence of clinical resistance to trimethoprim.探讨二氢叶酸还原酶基因的起源和动员，以及临床对甲氧苄啶耐药性的出现。

Microb Genom. 2020 Nov;6(11). doi: 10.1099/mgen.0.000440.

Quality Matters: Biocuration Experts on the Impact of Duplication and Other Data Quality Issues in Biological Databases.质量至关重要：生物编目专家谈生物数据库中重复及其他数据质量问题的影响

Genomics Proteomics Bioinformatics. 2020 Apr;18(2):91-103. doi: 10.1016/j.gpb.2018.11.006. Epub 2020 Jul 9.

Duplicates, redundancies and inconsistencies in the primary nucleotide databases: a descriptive study.主要核苷酸数据库中的重复、冗余和不一致性：一项描述性研究。

Database (Oxford). 2017 Jan 10;2017. doi: 10.1093/database/baw163. Print 2017.

本文引用的文献

Duplicates, redundancies and inconsistencies in the primary nucleotide databases: a descriptive study.主要核苷酸数据库中的重复、冗余和不一致性：一项描述性研究。

Database (Oxford). 2017 Jan 10;2017. doi: 10.1093/database/baw163. Print 2017.

Supervised Learning for Detection of Duplicates in Genomic Sequence Databases.用于检测基因组序列数据库中重复序列的监督学习

PLoS One. 2016 Aug 4;11(8):e0159644. doi: 10.1371/journal.pone.0159644. eCollection 2016.

The Pfam protein families database: towards a more sustainable future.Pfam蛋白质家族数据库：迈向更可持续的未来。

Nucleic Acids Res. 2016 Jan 4;44(D1):D279-85. doi: 10.1093/nar/gkv1344. Epub 2015 Dec 15.

ONRLDB--manually curated database of experimentally validated ligands for orphan nuclear receptors: insights into new drug discovery.ONRLDB——孤儿核受体实验验证配体的人工管理数据库：对新药发现的见解

Database (Oxford). 2015 Dec 4;2015. doi: 10.1093/database/bav112. Print 2015.

NeuroTransDB: highly curated and structured transcriptomic metadata for neurodegenerative diseases.神经递质数据库（NeuroTransDB）：针对神经退行性疾病精心策划和结构化的转录组元数据。

Database (Oxford). 2015 Oct 16;2015. doi: 10.1093/database/bav099. Print 2015.

Comprehensive comparative homeobox gene annotation in human and mouse.人类和小鼠中全面的同源框基因比较注释

Database (Oxford). 2015 Sep 27;2015. doi: 10.1093/database/bav091. Print 2015.

Starcode: sequence clustering based on all-pairs search.星码：基于全对搜索的序列聚类。

Bioinformatics. 2015 Jun 15;31(12):1913-9. doi: 10.1093/bioinformatics/btv053. Epub 2015 Jan 31.

GenBank.基因银行

Nucleic Acids Res. 2015 Jan;43(Database issue):D30-5. doi: 10.1093/nar/gku1216. Epub 2014 Nov 20.

UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches.UniRef聚类：一种用于改进序列相似性搜索的全面且可扩展的替代方法。

Bioinformatics. 2015 Mar 15;31(6):926-32. doi: 10.1093/bioinformatics/btu739. Epub 2014 Nov 13.

UniProt: a hub for protein information.通用蛋白质数据库（UniProt）：蛋白质信息中心。

Nucleic Acids Res. 2015 Jan;43(Database issue):D204-12. doi: 10.1093/nar/gku989. Epub 2014 Oct 27.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

核苷酸数据库中重复检测方法的测量基准。

Benchmarks for measurement of duplicate detection methods in nucleotide databases.

作者信息

机构信息

出版信息

UNLABELLED

DATABASE URL

未标注

数据库网址

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献