主要核苷酸数据库中的重复、冗余和不一致性：一项描述性研究。

Duplicates, redundancies and inconsistencies in the primary nucleotide databases: a descriptive study.

作者信息

Chen Qingyu, Zobel Justin, Verspoor Karin

机构信息

Department of Computing and Information Systems, The University of Melbourne, Parkville, VIC, 3010, Australia.

出版信息

Database (Oxford). 2017 Jan 10;2017. doi: 10.1093/database/baw163. Print 2017.

DOI:10.1093/database/baw163

PMID:28077566

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5225397/

Abstract

GenBank, the EMBL European Nucleotide Archive and the DNA DataBank of Japan, known collectively as the International Nucleotide Sequence Database Collaboration or INSDC, are the three most significant nucleotide sequence databases. Their records are derived from laboratory work undertaken by different individuals, by different teams, with a range of technologies and assumptions and over a period of decades. As a consequence, they contain a great many duplicates, redundancies and inconsistencies, but neither the prevalence nor the characteristics of various types of duplicates have been rigorously assessed. Existing duplicate detection methods in bioinformatics only address specific duplicate types, with inconsistent assumptions; and the impact of duplicates in bioinformatics databases has not been carefully assessed, making it difficult to judge the value of such methods. Our goal is to assess the scale, kinds and impact of duplicates in bioinformatics databases, through a retrospective analysis of merged groups in INSDC databases. Our outcomes are threefold: (1) We analyse a benchmark dataset consisting of duplicates manually identified in INSDC-a dataset of 67 888 merged groups with 111 823 duplicate pairs across 21 organisms from INSDC databases - in terms of the prevalence, types and impacts of duplicates. (2) We categorize duplicates at both sequence and annotation level, with supporting quantitative statistics, showing that different organisms have different prevalence of distinct kinds of duplicate. (3) We show that the presence of duplicates has practical impact via a simple case study on duplicates, in terms of GC content and melting temperature. We demonstrate that duplicates not only introduce redundancy, but can lead to inconsistent results for certain tasks. Our findings lead to a better understanding of the problem of duplication in biological databases.Database URL: the merged records are available at https://cloudstor.aarnet.edu.au/plus/index.php/s/Xef2fvsebBEAv9w.

摘要

GenBank、欧洲分子生物学实验室核苷酸数据库（EMBL）和日本DNA数据库（DDBJ）统称为国际核苷酸序列数据库协作组织（INSDC），是三个最重要的核苷酸序列数据库。它们的记录源自不同个人、不同团队在几十年间运用一系列技术并基于各种假设开展的实验室工作。因此，这些数据库包含大量重复、冗余和不一致的信息，但各类重复信息的普遍程度和特征尚未得到严格评估。生物信息学中现有的重复检测方法仅针对特定类型的重复，假设也不一致；而且生物信息学数据库中重复信息的影响尚未得到仔细评估，难以判断这些方法的价值。我们的目标是通过对INSDC数据库中合并组的回顾性分析，评估生物信息学数据库中重复信息的规模、种类和影响。我们的成果有三个方面：（1）我们分析了一个基准数据集，该数据集由在INSDC中手动识别的重复项组成——这是一个包含67888个合并组的数据集，来自INSDC数据库的21种生物中有111823对重复序列，分析内容包括重复项的普遍程度、类型和影响。（2）我们在序列和注释层面将重复项分类，并提供支持性的定量统计数据，表明不同生物中不同类型重复项的普遍程度不同。（3）通过一个关于重复项的简单案例研究，我们表明重复项的存在在GC含量和熔解温度方面具有实际影响。我们证明重复项不仅会引入冗余，还可能导致某些任务的结果不一致。我们的研究结果有助于更好地理解生物数据库中的重复问题。数据库网址：合并后的记录可在https://cloudstor.aarnet.edu.au/plus/index.php/s/Xef2fvsebBEAv9w获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/52ba/5225397/1b868ea0d01c/baw163f1p.jpg

相似文献

Duplicates, redundancies and inconsistencies in the primary nucleotide databases: a descriptive study.主要核苷酸数据库中的重复、冗余和不一致性：一项描述性研究。

Database (Oxford). 2017 Jan 10;2017. doi: 10.1093/database/baw163. Print 2017.

Supervised Learning for Detection of Duplicates in Genomic Sequence Databases.用于检测基因组序列数据库中重复序列的监督学习

PLoS One. 2016 Aug 4;11(8):e0159644. doi: 10.1371/journal.pone.0159644. eCollection 2016.

Benchmarks for measurement of duplicate detection methods in nucleotide databases.核苷酸数据库中重复检测方法的测量基准。

Database (Oxford). 2017 Jan 8;2023. doi: 10.1093/database/baw164.

The international nucleotide sequence database collaboration.国际核苷酸序列数据库合作组织。

Nucleic Acids Res. 2018 Jan 4;46(D1):D48-D51. doi: 10.1093/nar/gkx1097.

EMBLmyGFF3: a converter facilitating genome annotation submission to European Nucleotide Archive.EMBLmyGFF3：一种便于向欧洲核苷酸档案库提交基因组注释的转换器。

BMC Res Notes. 2018 Aug 13;11(1):584. doi: 10.1186/s13104-018-3686-x.

The international nucleotide sequence database collaboration.国际核苷酸序列数据库合作组织。

Nucleic Acids Res. 2021 Jan 8;49(D1):D121-D124. doi: 10.1093/nar/gkaa967.

DNA Data Bank of Japan.日本DNA数据库。

Nucleic Acids Res. 2017 Jan 4;45(D1):D25-D31. doi: 10.1093/nar/gkw1001. Epub 2016 Oct 24.

Literature consistency of bioinformatics sequence databases is effective for assessing record quality.生物信息学序列数据库的文献一致性对于评估记录质量是有效的。

Database (Oxford). 2017 Jan 1;2017(1). doi: 10.1093/database/bax021.

Managing Sequence Data.管理序列数据。

Methods Mol Biol. 2017;1525:79-106. doi: 10.1007/978-1-4939-6622-6_4.

The EMBL Nucleotide Sequence Database.欧洲分子生物学实验室核苷酸序列数据库。

Nucleic Acids Res. 2005 Jan 1;33(Database issue):D29-33. doi: 10.1093/nar/gki098.

引用本文的文献

Lost in .*VCF Translation. From Data Fragmentation to Precision Genomics: Technical, Ethical, and Interpretive Challenges in the Post-Sequencing Era.迷失在.*VCF 翻译中。从数据碎片化到精准基因组学：测序后时代的技术、伦理和解释挑战。

J Pers Med. 2025 Aug 20;15(8):390. doi: 10.3390/jpm15080390.

Evaluation of DNA barcoding reference databases for marine species in the western and central Pacific Ocean.西太平洋和中太平洋海洋物种DNA条形码参考数据库评估

PeerJ. 2025 Jul 14;13:e19674. doi: 10.7717/peerj.19674. eCollection 2025.

gymnotoa-db: a database and application to optimize functional annotation in gymnosperms.裸子植物数据库（gymnotoa-db）：一个用于优化裸子植物功能注释的数据库及应用程序。

Database (Oxford). 2025 Mar 5;2025. doi: 10.1093/database/baaf019.

Advances in stress-tolerance elements for microbial cell factories.微生物细胞工厂抗逆元件的研究进展

Synth Syst Biotechnol. 2024 Jun 28;9(4):793-808. doi: 10.1016/j.synbio.2024.06.008. eCollection 2024 Dec.

Integration of background knowledge for automatic detection of inconsistencies in gene ontology annotation.背景知识的整合用于自动检测基因本体论注释中的不一致性。

Bioinformatics. 2024 Jun 28;40(Suppl 1):i390-i400. doi: 10.1093/bioinformatics/btae246.

Investigating diversity and similarity between CBM13 modules and ricin-B lectin domains using sequence similarity networks.运用序列相似性网络研究 CBM13 模块与蓖麻毒素-B 凝集素结构域的多样性和相似性。

BMC Genomics. 2024 Jun 27;25(1):643. doi: 10.1186/s12864-024-10554-1.

Beyond the Human Genome Project: The Age of Complete Human Genome Sequences and Pangenome References.超越人类基因组计划：完整人类基因组序列和泛基因组参考时代。

Annu Rev Genomics Hum Genet. 2024 Aug;25(1):77-104. doi: 10.1146/annurev-genom-021623-081639. Epub 2024 Aug 6.

Metalloproteinases in Restorative Dentistry: An In Silico Study toward an Ideal Animal Model.口腔修复学中的金属蛋白酶：建立理想动物模型的计算机模拟研究

Biomedicines. 2023 Nov 14;11(11):3042. doi: 10.3390/biomedicines11113042.

StrainSelect: A novel microbiome reference database that disambiguates all bacterial strains, genome assemblies and extant cultures worldwide.菌株选择：一个新颖的微生物组参考数据库，可消除全球所有细菌菌株、基因组组装体和现存培养物的歧义。

Heliyon. 2023 Feb 4;9(2):e13314. doi: 10.1016/j.heliyon.2023.e13314. eCollection 2023 Feb.

PPNet: Identifying Functional Association Networks by Phylogenetic Profiling of Prokaryotic Genomes.PPNet：通过对原核基因组的系统发育分析来识别功能关联网络。

Microbiol Spectr. 2023 Feb 14;11(1):e0387122. doi: 10.1128/spectrum.03871-22. Epub 2023 Jan 5.

本文引用的文献

Benchmarks for measurement of duplicate detection methods in nucleotide databases.核苷酸数据库中重复检测方法的测量基准。

Database (Oxford). 2017 Jan 8;2023. doi: 10.1093/database/baw164.

The Pfam protein families database: towards a more sustainable future.Pfam蛋白质家族数据库：迈向更可持续的未来。

Nucleic Acids Res. 2016 Jan 4;44(D1):D279-85. doi: 10.1093/nar/gkv1344. Epub 2015 Dec 15.

ONRLDB--manually curated database of experimentally validated ligands for orphan nuclear receptors: insights into new drug discovery.ONRLDB——孤儿核受体实验验证配体的人工管理数据库：对新药发现的见解

Database (Oxford). 2015 Dec 4;2015. doi: 10.1093/database/bav112. Print 2015.

Assembly: a resource for assembled genomes at NCBI.组装：美国国立医学图书馆（NCBI）中已组装基因组的资源。

Nucleic Acids Res. 2016 Jan 4;44(D1):D73-80. doi: 10.1093/nar/gkv1226. Epub 2015 Nov 17.

Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation.美国国立生物技术信息中心的参考序列（RefSeq）数据库：当前状态、分类扩展及功能注释。

Nucleic Acids Res. 2016 Jan 4;44(D1):D733-45. doi: 10.1093/nar/gkv1189. Epub 2015 Nov 8.

Comprehensive comparative homeobox gene annotation in human and mouse.人类和小鼠中全面的同源框基因比较注释

Database (Oxford). 2015 Sep 27;2015. doi: 10.1093/database/bav091. Print 2015.

GC-Content evolution in bacterial genomes: the biased gene conversion hypothesis expands.细菌基因组中鸟嘌呤-胞嘧啶含量的进化：偏向性基因转换假说的扩展。

PLoS Genet. 2015 Feb 6;11(2):e1004941. doi: 10.1371/journal.pgen.1004941. eCollection 2015 Feb.

Starcode: sequence clustering based on all-pairs search.星码：基于全对搜索的序列聚类。

Bioinformatics. 2015 Jun 15;31(12):1913-9. doi: 10.1093/bioinformatics/btv053. Epub 2015 Jan 31.

UniProt: a hub for protein information.通用蛋白质数据库（UniProt）：蛋白质信息中心。

Nucleic Acids Res. 2015 Jan;43(Database issue):D204-12. doi: 10.1093/nar/gku989. Epub 2014 Oct 27.

Mapping biological entities using the longest approximately common prefix method.使用最长近似公共前缀方法对生物实体进行映射。

BMC Bioinformatics. 2014 Jun 14;15:187. doi: 10.1186/1471-2105-15-187.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

主要核苷酸数据库中的重复、冗余和不一致性：一项描述性研究。

Duplicates, redundancies and inconsistencies in the primary nucleotide databases: a descriptive study.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献