Suppr超能文献

我们(希望)信赖凭证:揭示GenBank四足动物分类学基础中的隐藏错误。

In Vouchers We (Hope to) Trust: Unveiling Hidden Errors in GenBank's Tetrapod Taxonomic Foundations.

作者信息

Carné Albert, Vieites David R, van den Burg Matthijs P

机构信息

Science and Business S.L., Edificio CITEXVI, Campus Universitario de Vigo, Vigo, Galicia, Spain.

Department of Biodiversity and Evolutionary Biology, Museo Nacional de Ciencias Naturales (MNCN), CSIC, Madrid, Spain.

出版信息

Mol Ecol. 2025 Jul;34(13):e17812. doi: 10.1111/mec.17812. Epub 2025 Jun 3.

Abstract

Genetic repositories are invaluable resources foundational to various biological disciplines. While their data and metadata reliability are essential for robust research outcomes, numerous studies have highlighted data quality and consistency issues. Here, we detect and quantify errors at the most fundamental level by analysing the congruence of sequences derived from the same genetic marker and specimen voucher across tetrapods. Our analysis reveals that 32% of re-sequenced vouchers (with identical field or museum numbers) yield unequal sequences, ranging from a few mutations to significant divergences (0.06%-33.95%). These divergences may result from sample misidentification, labelling errors, fidelity disparities between sequencing methods, or contamination at various stages of the research process. Our findings demonstrate errors within GenBank at its most basal level and suggest that, although undetectable, a similar error rate likely exists in non-re-sequenced data. These previously overlooked errors are concerning because they arise from replicated experiments, which are uncommon, and raise serious questions about the reliability of non-re-sequenced specimens. Such errors can compromise the accuracy of biodiversity assessments (e.g., taxonomic assessment, eDNA and barcoding), phylogenetic analyses and conservation planning by artificially inflating the intraspecific divergence or misidentifying (to-be-described) species. Additionally, the accuracy of large-scale biological studies that rely on such data can be compromised. Our concerning results call for protocols ensuring sample traceability to the specimens or tissues during the whole process of data generation, analysis and deposition in a database. We propose a third-party annotation system for individual GenBank records that would allow flagging common errors and alert both the original submitter and all users to potential problems without modifying the original records.

摘要

基因库是各个生物学学科的宝贵基础资源。虽然它们的数据和元数据可靠性对于可靠的研究结果至关重要,但众多研究已经强调了数据质量和一致性问题。在这里,我们通过分析来自四足动物同一基因标记和标本凭证的序列一致性,在最基本层面检测和量化错误。我们的分析表明,32%的重新测序凭证(具有相同的野外或博物馆编号)产生了不相等的序列,从少量突变到显著差异(0.06%-33.95%)不等。这些差异可能是由于样本误认、标签错误、测序方法之间的保真度差异或研究过程各个阶段的污染造成的。我们的研究结果表明了GenBank在最基础层面存在错误,并表明,尽管无法检测到,但非重新测序数据中可能存在类似的错误率。这些以前被忽视的错误令人担忧,因为它们源于重复实验(这并不常见),并对非重新测序标本的可靠性提出了严重质疑。此类错误可能会通过人为夸大种内差异或误认(待描述)物种来损害生物多样性评估(如分类评估、环境DNA和条形码)、系统发育分析和保护规划的准确性。此外,依赖此类数据的大规模生物学研究的准确性也可能受到损害。我们令人担忧的结果呼吁制定协议,确保在数据生成、分析和存入数据库的整个过程中样本能够追溯到标本或组织。我们为单个GenBank记录提出了一个第三方注释系统,该系统将允许标记常见错误,并在不修改原始记录的情况下提醒原始提交者和所有用户潜在问题。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验