Martí Jose Manuel, Kok Car Reen, Thissen James B, Mulakken Nisha J, Avila-Herrera Aram, Jaing Crystal J, Allen Jonathan E, Be Nicholas A
Global Security Computing Applications Division, Lawrence Livermore National Laboratory, Livermore, California, USA.
Biosciences and Biotechnology Division, Lawrence Livermore National Laboratory, Livermore, California, USA.
mSystems. 2025 Apr 22;10(4):e0123924. doi: 10.1128/msystems.01239-24. Epub 2025 Mar 20.
Accurate metagenomic classification relies on comprehensive, up-to-date, and validated reference databases. While the NCBI BLAST Nucleotide (nt) database, encompassing a vast collection of sequences from all domains of life, represents an invaluable resource, its massive size-currently exceeding 10 nucleotides-and exponential growth pose significant challenges for researchers seeking to maintain current nt-based indices for metagenomic classification. Recognizing that no current nt-based indices exist for the widely used Centrifuge classifier, and the last public version currently available was released in 2018, we addressed this critical gap by leveraging advanced high-performance computing resources. We present new Centrifuge-compatible nt databases, meticulously constructed using a novel pipeline incorporating different quality control measures, including reference decontamination and filtering. These measures demonstrably reduce spurious classifications, as shown through our reanalysis of published metagenomic data where annotations were dramatically reduced using our decontaminated database, highlighting how database quality can significantly impact research conclusions. Through temporal comparisons, we also reveal how our approach minimizes inconsistencies in taxonomic assignments stemming from asynchronous updates between public sequence and taxonomy databases. These discrepancies are particularly evident in taxa such as and , where classification accuracy varied significantly across database versions. These new databases, made available as pre-built Centrifuge indexes, respond to the need for an open, robust, nt-based pipeline for taxonomic classification in metagenomics. Applications such as environmental metagenomics, forensics, and clinical metagenomics, which require comprehensive taxonomic coverage, will benefit from this resource. Our work highlights the importance of treating reference databases as dynamic entities, subject to ongoing quality control and validation akin to software development best practices. This approach is crucial for ensuring accuracy and reliability of metagenomic analysis, especially as databases continue to expand in size and complexity.
Accurately identifying the diverse microbes present in a sample, whether from the human gut, a soil sample, or a crime scene, is crucial for fields ranging from medicine to environmental science. Researchers rely on comprehensive DNA databases to match sequenced DNA fragments to known microbial species. However, the widely used NCBI nt database, while vast, poses significant challenges. Its massive size makes it difficult for many researchers to use effectively with taxonomic classifiers, and inconsistencies and contamination within the database can impact the accuracy of microbial identification. This work addresses these challenges by providing cleaned, updated, and validated nt-based databases specifically optimized for the widely used Centrifuge classification tool. This new resource demonstrably reduces errors and improves the reliability of microbial identification across diverse taxonomic groups. Moreover, by providing readily usable indexes, we overcome the size barrier, enabling researchers to leverage the full potential of the nt database for metagenomic analysis. Our findings underscore the need to treat reference databases as dynamic entities, emphasizing continuous quality control and versioning as essential practices for robust and reproducible metagenomics research.
准确的宏基因组分类依赖于全面、最新且经过验证的参考数据库。虽然美国国立医学图书馆(NCBI)的BLAST核苷酸(nt)数据库包含了来自生命所有领域的大量序列,是一项宝贵的资源,但其庞大的规模——目前已超过10亿个核苷酸——以及指数级增长,给寻求维护当前基于nt的宏基因组分类索引的研究人员带来了重大挑战。认识到广泛使用的Centrifuge分类器目前没有基于nt的索引,且当前可用的最后一个公共版本是在2018年发布的,我们利用先进的高性能计算资源填补了这一关键空白。我们展示了新的与Centrifuge兼容的nt数据库,这些数据库是使用一种新颖的管道精心构建的,该管道纳入了不同的质量控制措施,包括参考序列净化和过滤。这些措施显著减少了虚假分类,正如我们对已发表的宏基因组数据进行重新分析所显示的那样,使用我们净化后的数据库注释大幅减少,突出了数据库质量如何能显著影响研究结论。通过时间比较,我们还揭示了我们的方法如何最大限度地减少由于公共序列数据库和分类数据库之间异步更新而导致的分类学分配不一致。这些差异在诸如[具体分类群1]和[具体分类群2]等分类群中尤为明显,其中分类准确性在不同数据库版本中差异显著。这些新数据库以预构建的Centrifuge索引形式提供,满足了宏基因组学中对开放、强大的基于nt的分类管道的需求。诸如环境宏基因组学、法医学和临床宏基因组学等需要全面分类覆盖的应用将从这一资源中受益。我们的工作强调了将参考数据库视为动态实体的重要性,类似于软件开发最佳实践,需要持续的质量控制和验证。这种方法对于确保宏基因组分析的准确性和可靠性至关重要,特别是随着数据库在规模和复杂性上不断扩大。
准确识别样本中存在的各种微生物,无论是来自人类肠道、土壤样本还是犯罪现场,对于从医学到环境科学等各个领域都至关重要。研究人员依靠全面的DNA数据库将测序的DNA片段与已知微生物物种进行匹配。然而,广泛使用的NCBI nt数据库虽然庞大,但也带来了重大挑战。其庞大的规模使得许多研究人员难以有效地与分类学分类器一起使用,并且数据库中的不一致性和污染会影响微生物识别的准确性。这项工作通过提供专门为广泛使用的Centrifuge分类工具优化的清理、更新和经过验证的基于nt的数据库来应对这些挑战。这一新资源显著减少了错误并提高了跨不同分类群的微生物识别的可靠性。此外,通过提供易于使用的索引,我们克服了规模障碍,使研究人员能够充分利用nt数据库进行宏基因组分析。我们的发现强调了将参考数据库视为动态实体的必要性,强调持续的质量控制和版本控制是稳健且可重复的宏基因组学研究的基本实践。