DFAST_QC：用于原核生物基因组的质量评估和分类鉴定工具。

DFAST_QC: quality assessment and taxonomic identification tool for prokaryotic Genomes.

作者信息

Elmanzalawi Mohamed, Fujisawa Takatomo, Mori Hiroshi, Nakamura Yasukazu, Tanizawa Yasuhiro

机构信息

Department of Genetics, School of Life Science, The Graduate University for Advanced Studies (SOKENDAI), Mishima, 411-8540, Japan.

Department of Informatics, National Institute of Genetics, Mishima, 411-8540, Japan.

出版信息

BMC Bioinformatics. 2025 Jan 7;26(1):3. doi: 10.1186/s12859-024-06030-y.

DOI:10.1186/s12859-024-06030-y

PMID:39773409

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11705978/

Abstract

BACKGROUND

Accurate taxonomic classification in genome databases is essential for reliable biological research and effective data sharing. Mislabeling or inaccuracies in genome annotations can lead to incorrect scientific conclusions and hinder the reproducibility of research findings. Despite advances in genome analysis techniques, challenges persist in ensuring precise and reliable taxonomic assignments. Existing tools for genome verification often involve extensive computational resources or lengthy processing times, which can limit their accessibility and scalability for large-scale projects. There is a need for more efficient, user-friendly solutions that can handle diverse datasets and provide accurate results with minimal computational demands. This work aimed to address these challenges by introducing a novel tool that enhances taxonomic accuracy, offers a user-friendly interface, and supports large-scale analyses.

RESULTS

We introduce a novel tool for the quality control and taxonomic classification tool of prokaryotic genomes, called DFAST_QC, which is available as both a command-line tool and a web service. DFAST_QC can quickly identify species based on NCBI and GTDB taxonomies by combining genome-distance calculations using MASH with ANI calculations using Skani. We evaluated DFAST_QC's performance in species identification and found it to be highly consistent with existing taxonomic standards, successfully identifying species across diverse datasets. In several cases, DFAST_QC identified potential mislabeling of species names in public databases and highlighted discrepancies in current classifications, demonstrating its capability to uncover errors and enhance taxonomic accuracy. Additionally, the tool's efficient design allows it to operate smoothly on local machines with minimal computational requirements, making it a practical choice for large-scale genome projects.

CONCLUSIONS

DFAST_QC is a reliable and efficient tool for accurate taxonomic identification and genome quality control, well-suited for large-scale genomic studies. Its compatibility with limited-resource environments, combined with its user-friendly design, ensures seamless integration into existing workflows. DFAST_QC's ability to refine species assignments in public databases highlights its value as a complementary tool for maintaining and enhancing the accuracy of taxonomic data in genomic research. The web version is available at https://dfast.ddbj.nig.ac.jp/dqc/submit/ , and the source code for local use can be found at https://github.com/nigyta/dfast_qc .

摘要

背景

基因组数据库中的准确分类对于可靠的生物学研究和有效的数据共享至关重要。基因组注释中的错误标记或不准确可能导致错误的科学结论，并阻碍研究结果的可重复性。尽管基因组分析技术取得了进展，但在确保精确和可靠的分类分配方面仍然存在挑战。现有的基因组验证工具通常需要大量的计算资源或较长的处理时间，这可能会限制它们在大规模项目中的可及性和可扩展性。需要更高效、用户友好的解决方案，能够处理各种数据集并以最少的计算需求提供准确的结果。这项工作旨在通过引入一种新工具来应对这些挑战，该工具可提高分类准确性、提供用户友好的界面并支持大规模分析。

结果

我们引入了一种用于原核生物基因组质量控制和分类的新工具，称为DFAST_QC，它既可以作为命令行工具使用，也可以作为网络服务使用。DFAST_QC可以通过将使用MASH进行的基因组距离计算与使用Skani进行的ANI计算相结合，基于NCBI和GTDB分类法快速识别物种。我们评估了DFAST_QC在物种识别方面的性能，发现它与现有的分类标准高度一致，成功地在各种数据集中识别了物种。在几个案例中，DFAST_QC识别出了公共数据库中物种名称的潜在错误标记，并突出了当前分类中的差异，证明了其发现错误和提高分类准确性的能力。此外，该工具的高效设计使其能够在本地机器上以最少的计算需求平稳运行，使其成为大规模基因组项目的实际选择。

结论

DFAST_QC是一种用于准确分类识别和基因组质量控制的可靠且高效的工具，非常适合大规模基因组研究。它与资源有限的环境的兼容性，加上其用户友好的设计，确保了无缝集成到现有工作流程中。DFAST_QC在公共数据库中细化物种分配的能力突出了其作为维护和提高基因组研究中分类数据准确性的补充工具的价值。网络版本可在https://dfast.ddbj.nig.ac.jp/dqc/submit/ 获得，本地使用的源代码可在https://github.com/nigyta/dfast_qc 找到。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/df6d/11705978/f1123d6cf49f/12859_2024_6030_Fig1_HTML.jpg

相似文献

DFAST_QC: quality assessment and taxonomic identification tool for prokaryotic Genomes.DFAST_QC：用于原核生物基因组的质量评估和分类鉴定工具。

BMC Bioinformatics. 2025 Jan 7;26(1):3. doi: 10.1186/s12859-024-06030-y.

DFAST: a flexible prokaryotic genome annotation pipeline for faster genome publication.DFAST：一个灵活的原核生物基因组注释管道，用于更快地发布基因组。

Bioinformatics. 2018 Mar 15;34(6):1037-1039. doi: 10.1093/bioinformatics/btx713.

DFAST and DAGA: web-based integrated genome annotation tools and resources.DFAST和DAGA：基于网络的综合基因组注释工具与资源。

Biosci Microbiota Food Health. 2016;35(4):173-184. doi: 10.12938/bmfh.16-003. Epub 2016 Jul 14.

Generating Publication-Ready Prokaryotic Genome Annotations with DFAST.使用DFAST生成可用于发表的原核生物基因组注释。

Methods Mol Biol. 2019;1962:215-226. doi: 10.1007/978-1-4939-9173-0_13.

GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy.GTDB：通过系统发生一致、等级归一化和基于完整基因组的分类学，对细菌和古菌多样性进行持续普查。

Nucleic Acids Res. 2022 Jan 7;50(D1):D785-D794. doi: 10.1093/nar/gkab776.

proGenomes2: an improved database for accurate and consistent habitat, taxonomic and functional annotations of prokaryotic genomes.proGenomes2：一个用于准确和一致地注释原核基因组的栖息地、分类和功能的改进型数据库。

Nucleic Acids Res. 2020 Jan 8;48(D1):D621-D625. doi: 10.1093/nar/gkz1002.

proGenomes: a resource for consistent functional and taxonomic annotations of prokaryotic genomes.原核生物基因组数据库（proGenomes）：一个用于原核生物基因组一致性功能和分类注释的资源库。

Nucleic Acids Res. 2017 Jan 4;45(D1):D529-D534. doi: 10.1093/nar/gkw989. Epub 2016 Oct 24.

Addressing the dynamic nature of reference data: a new nucleotide database for robust metagenomic classification.应对参考数据的动态特性：一个用于可靠宏基因组分类的新核苷酸数据库。

mSystems. 2025 Apr 22;10(4):e0123924. doi: 10.1128/msystems.01239-24. Epub 2025 Mar 20.

Introducing EzAAI: a pipeline for high throughput calculations of prokaryotic average amino acid identity.介绍 EzAAI：一种用于高通量计算原核生物平均氨基酸身份的流水线。

J Microbiol. 2021 May;59(5):476-480. doi: 10.1007/s12275-021-1154-0. Epub 2021 Apr 28.

ML-DSP: Machine Learning with Digital Signal Processing for ultrafast, accurate, and scalable genome classification at all taxonomic levels.ML-DSP：利用数字信号处理进行机器学习，实现了在所有分类学水平上的超快、准确和可扩展的基因组分类。

BMC Genomics. 2019 Apr 3;20(1):267. doi: 10.1186/s12864-019-5571-y.

引用本文的文献

Complete genome sequence of the first strain isolated using an ethylene-α-olefin co-oligomer.使用乙烯-α-烯烃共聚体分离出的首个菌株的全基因组序列

Microbiol Resour Announc. 2025 Sep 11;14(9):e0058925. doi: 10.1128/mra.00589-25. Epub 2025 Aug 11.

Complete genomic sequences of and species isolated from surface seawater: potential polyolefin degraders and bioplastic producers.从表层海水中分离出的[具体物种1]和[具体物种2]的完整基因组序列：潜在的聚烯烃降解菌和生物塑料生产者。

Microbiol Resour Announc. 2025 Sep 11;14(9):e0058425. doi: 10.1128/mra.00584-25. Epub 2025 Jul 31.

Genomic insights into clinical non-O1/non-O139 e isolates in Japan.日本临床非O1/非O139霍乱弧菌分离株的基因组见解。

Microbiol Spectr. 2025 Aug 5;13(8):e0017525. doi: 10.1128/spectrum.00175-25. Epub 2025 Jun 24.

Complete genome sequence of a clinical isolate harboring a novel variant of the carbapenemase gene, .携带碳青霉烯酶基因新变体的临床分离株的全基因组序列

Microbiol Resour Announc. 2025 Jul 10;14(7):e0019625. doi: 10.1128/mra.00196-25. Epub 2025 Jun 18.

Draft genome sequences of six bacterial strains degrading the biodegradable plastic polyhydroxybutyrate (PHB).六种降解可生物降解塑料聚羟基丁酸酯（PHB）的细菌菌株的基因组序列草图

Microbiol Resour Announc. 2025 May 8;14(5):e0010525. doi: 10.1128/mra.00105-25. Epub 2025 Mar 27.

sp. nov., a novel psychrotolerant species produces antimicrobial agents targeting resistant clinical isolates of .新种，一种新型耐冷菌，产生针对耐药临床分离株的抗菌剂。

Curr Res Microb Sci. 2025 Jan 25;8:100353. doi: 10.1016/j.crmicr.2025.100353. eCollection 2025.

本文引用的文献

Update on the proposed minimal standards for the use of genome data for the taxonomy of prokaryotes.关于使用基因组数据对原核生物进行分类的最低标准建议的最新进展。

Int J Syst Evol Microbiol. 2024 Mar;74(3). doi: 10.1099/ijsem.0.006300.

Improving the gold standard in NCBI GenBank and related databases: DNA sequences from type specimens and type strains.提高 NCBI GenBank 和相关数据库的黄金标准：来自模式标本和模式菌株的 DNA 序列。

Syst Biol. 2024 Jul 27;73(2):486-494. doi: 10.1093/sysbio/syad068.

Fast and robust metagenomic sequence comparison through sparse chaining with skani.通过使用 skani 进行稀疏链接实现快速稳健的宏基因组序列比较。

Nat Methods. 2023 Nov;20(11):1661-1665. doi: 10.1038/s41592-023-02018-3. Epub 2023 Sep 21.

Collection and curation of prokaryotic genome assemblies from type strains at NCBI.从 NCBI 的模式菌株中收集和整理原核生物基因组组装。

Int J Syst Evol Microbiol. 2023 Feb;73(1). doi: 10.1099/ijsem.0.005707.

Propagation, detection and correction of errors using the sequence database network.利用序列数据库网络进行错误的传播、检测和纠正。

Brief Bioinform. 2022 Nov 19;23(6). doi: 10.1093/bib/bbac416.

GTDB-Tk v2: memory friendly classification with the genome taxonomy database.GTDB-Tk v2：使用基因组分类数据库实现内存友好的分类。

Bioinformatics. 2022 Nov 30;38(23):5315-5316. doi: 10.1093/bioinformatics/btac672.

TYGS and LPSN: a database tandem for fast and reliable genome-based classification and nomenclature of prokaryotes.TYGS 和 LPSN：用于快速可靠的基于基因组的原核生物分类和命名的数据库串联。

Nucleic Acids Res. 2022 Jan 7;50(D1):D801-D807. doi: 10.1093/nar/gkab902.

Nucleic Acids Res. 2022 Jan 7;50(D1):D785-D794. doi: 10.1093/nar/gkab776.

A genomic catalog of Earth's microbiomes.地球微生物组的基因组目录。

Nat Biotechnol. 2021 Apr;39(4):499-509. doi: 10.1038/s41587-020-0718-6. Epub 2020 Nov 9.

gcType: a high-quality type strain genome database for microbial phylogenetic and functional research.gcType：用于微生物系统发育和功能研究的高质量模式菌株基因组数据库。

Nucleic Acids Res. 2021 Jan 8;49(D1):D694-D705. doi: 10.1093/nar/gkaa957.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

DFAST_QC：用于原核生物基因组的质量评估和分类鉴定工具。

DFAST_QC: quality assessment and taxonomic identification tool for prokaryotic Genomes.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献