• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于机器学习和数据库的方法在高通量测序数据分类中的应用与比较。

Application and Comparison of Machine Learning and Database-Based Methods in Taxonomic Classification of High-Throughput Sequencing Data.

机构信息

Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China.

Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324003  China.

出版信息

Genome Biol Evol. 2024 May 2;16(5). doi: 10.1093/gbe/evae102.

DOI:10.1093/gbe/evae102
PMID:38748485
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11135637/
Abstract

The advent of high-throughput sequencing technologies has not only revolutionized the field of bioinformatics but has also heightened the demand for efficient taxonomic classification. Despite technological advancements, efficiently processing and analyzing the deluge of sequencing data for precise taxonomic classification remains a formidable challenge. Existing classification approaches primarily fall into two categories, database-based methods and machine learning methods, each presenting its own set of challenges and advantages. On this basis, the aim of our study was to conduct a comparative analysis between these two methods while also investigating the merits of integrating multiple database-based methods. Through an in-depth comparative study, we evaluated the performance of both methodological categories in taxonomic classification by utilizing simulated data sets. Our analysis revealed that database-based methods excel in classification accuracy when backed by a rich and comprehensive reference database. Conversely, while machine learning methods show superior performance in scenarios where reference sequences are sparse or lacking, they generally show inferior performance compared with database methods under most conditions. Moreover, our study confirms that integrating multiple database-based methods does, in fact, enhance classification accuracy. These findings shed new light on the taxonomic classification of high-throughput sequencing data and bear substantial implications for the future development of computational biology. For those interested in further exploring our methods, the source code of this study is publicly available on https://github.com/LoadStar822/Genome-Classifier-Performance-Evaluator. Additionally, a dedicated webpage showcasing our collected database, data sets, and various classification software can be found at http://lab.malab.cn/~tqz/project/taxonomic/.

摘要

高通量测序技术的出现不仅彻底改变了生物信息学领域,也对高效的分类学提出了更高的要求。尽管技术在不断进步,但有效地处理和分析大量测序数据以进行准确的分类仍然是一个巨大的挑战。现有的分类方法主要分为两类,基于数据库的方法和机器学习方法,每种方法都有其自身的一系列挑战和优势。在此基础上,我们的研究旨在对这两种方法进行比较分析,同时研究整合多种基于数据库的方法的优点。通过深入的比较研究,我们利用模拟数据集评估了这两种方法类别在分类中的性能。我们的分析表明,基于数据库的方法在拥有丰富全面的参考数据库的情况下,在分类准确性方面表现出色。相比之下,机器学习方法在参考序列稀疏或缺乏的情况下表现出更好的性能,但在大多数情况下,它们的性能一般不如基于数据库的方法。此外,我们的研究证实,整合多种基于数据库的方法确实可以提高分类准确性。这些发现为高通量测序数据的分类学提供了新的视角,对计算生物学的未来发展具有重要意义。对于有兴趣进一步探索我们的方法的人,可以在 https://github.com/LoadStar822/Genome-Classifier-Performance-Evaluator 上获得本研究的源代码。此外,还可以在 http://lab.malab.cn/~tqz/project/taxonomic/ 上找到我们收集的数据库、数据集和各种分类软件的专用网页。

相似文献

1
Application and Comparison of Machine Learning and Database-Based Methods in Taxonomic Classification of High-Throughput Sequencing Data.基于机器学习和数据库的方法在高通量测序数据分类中的应用与比较。
Genome Biol Evol. 2024 May 2;16(5). doi: 10.1093/gbe/evae102.
2
Correcting the Estimation of Viral Taxa Distributions in Next-Generation Sequencing Data after Applying Artificial Neural Networks.应用人工神经网络后校正下一代测序数据中病毒分类群分布的估计。
Genes (Basel). 2021 Oct 31;12(11):1755. doi: 10.3390/genes12111755.
3
MetageNN: a memory-efficient neural network taxonomic classifier robust to sequencing errors and missing genomes.MetageNN:一种内存高效的神经网络分类器,可稳健应对测序错误和缺失基因组。
BMC Bioinformatics. 2024 Apr 16;25(Suppl 1):153. doi: 10.1186/s12859-024-05760-3.
4
Re-purposing software for functional characterization of the microbiome.重新利用软件对微生物组进行功能特征分析。
Microbiome. 2021 Jan 9;9(1):4. doi: 10.1186/s40168-020-00971-1.
5
Large-scale machine learning for metagenomics sequence classification.用于宏基因组学序列分类的大规模机器学习
Bioinformatics. 2016 Apr 1;32(7):1023-32. doi: 10.1093/bioinformatics/btv683. Epub 2015 Nov 20.
6
Mycofier: a new machine learning-based classifier for fungal ITS sequences.Mycofier:一种基于机器学习的新型真菌ITS序列分类器。
BMC Res Notes. 2016 Aug 11;9(1):402. doi: 10.1186/s13104-016-2203-3.
7
Machine learning random forest for predicting oncosomatic variant NGS analysis.机器学习随机森林预测肿瘤体细胞变异 NGS 分析。
Sci Rep. 2021 Nov 8;11(1):21820. doi: 10.1038/s41598-021-01253-y.
8
Cataloguing the taxonomic origins of sequences from a heterogeneous sample using phylogenomics: applications in adventitious agent detection.利用系统发育基因组学对异质样本中序列的分类学起源进行编目:在检测外来因子中的应用。
PDA J Pharm Sci Technol. 2014 Nov-Dec;68(6):602-18. doi: 10.5731/pdajpst.2014.01023.
9
DCMD: Distance-based classification using mixture distributions on microbiome data.DCMD:基于距离的微生物组数据混合分布分类方法。
PLoS Comput Biol. 2021 Mar 12;17(3):e1008799. doi: 10.1371/journal.pcbi.1008799. eCollection 2021 Mar.
10
Scalable metagenomic taxonomy classification using a reference genome database.基于参考基因组数据库的可扩展宏基因组分类学分类。
Bioinformatics. 2013 Sep 15;29(18):2253-60. doi: 10.1093/bioinformatics/btt389. Epub 2013 Jul 4.

引用本文的文献

1
OpenFungi: A Machine Learning Dataset for Fungal Image Recognition Tasks.OpenFungi:用于真菌图像识别任务的机器学习数据集。
Life (Basel). 2025 Jul 18;15(7):1132. doi: 10.3390/life15071132.
2
Cotton under heat stress: a comprehensive review of molecular breeding, genomics, and multi-omics strategies.热胁迫下的棉花:分子育种、基因组学和多组学策略的综合综述
Front Genet. 2025 Mar 18;16:1553406. doi: 10.3389/fgene.2025.1553406. eCollection 2025.
3
Fast sequence alignment for centromeres with RaMA.使用RaMA对着丝粒进行快速序列比对。

本文引用的文献

1
FMAlign2: a novel fast multiple nucleotide sequence alignment method for ultralong datasets.FMAlign2:一种新颖的快速多核苷酸序列比对方法,适用于超大数据集。
Bioinformatics. 2024 Jan 2;40(1). doi: 10.1093/bioinformatics/btae014.
2
A k-mer-Based Approach for Phylogenetic Classification of Taxa in Environmental Genomic Data.基于 k- -mer 的环境基因组数据中分类单元的系统发育分类方法。
Syst Biol. 2023 Nov 1;72(5):1101-1118. doi: 10.1093/sysbio/syad037.
3
From defaults to databases: parameter and database choice dramatically impact the performance of metagenomic taxonomic classification tools.
Genome Res. 2025 May 2;35(5):1209-1218. doi: 10.1101/gr.279763.124.
4
HAlign 4: a new strategy for rapidly aligning millions of sequences.HAlign 4:一种快速比对数百万条序列的新策略。
Bioinformatics. 2024 Nov 28;40(12). doi: 10.1093/bioinformatics/btae718.
从默认值到数据库:参数和数据库的选择极大地影响了宏基因组分类工具的性能。
Microb Genom. 2023 Mar;9(3). doi: 10.1099/mgen.0.000949.
4
Extending and improving metagenomic taxonomic profiling with uncharacterized species using MetaPhlAn 4.利用 MetaPhlAn 4 对未鉴定物种进行宏基因组分类分析的扩展和改进。
Nat Biotechnol. 2023 Nov;41(11):1633-1644. doi: 10.1038/s41587-023-01688-w. Epub 2023 Feb 23.
5
Evaluation of taxonomic classification and profiling methods for long-read shotgun metagenomic sequencing datasets.评价长读 shotgun 宏基因组测序数据集的分类和分析方法。
BMC Bioinformatics. 2022 Dec 13;23(1):541. doi: 10.1186/s12859-022-05103-0.
6
MTSv: rapid alignment-based taxonomic classification and high-confidence metagenomic analysis.MTSv:快速基于比对的分类学分类和高置信度宏基因组分析。
PeerJ. 2022 Nov 8;10:e14292. doi: 10.7717/peerj.14292. eCollection 2022.
7
MEGARes and AMR++, v3.0: an updated comprehensive database of antimicrobial resistance determinants and an improved software pipeline for classification using high-throughput sequencing.MEGARes 和 AMR++,v3.0:一个更新的抗菌药物耐药性决定因素综合数据库,以及一个用于使用高通量测序进行分类的改进型软件管道。
Nucleic Acids Res. 2023 Jan 6;51(D1):D744-D752. doi: 10.1093/nar/gkac1047.
8
Theory of local k-mer selection with applications to long-read alignment.基于局部 k-mer 选择的理论及其在长读测序比对中的应用。
Bioinformatics. 2022 Oct 14;38(20):4659-4669. doi: 10.1093/bioinformatics/btab790.
9
Taxonomic classification of DNA sequences beyond sequence similarity using deep neural networks.基于深度神经网络的 DNA 序列分类研究:超越序列相似性的分类方法
Proc Natl Acad Sci U S A. 2022 Aug 30;119(35):e2122636119. doi: 10.1073/pnas.2122636119. Epub 2022 Aug 26.
10
Species determination using AI machine-learning algorithms: Hebeloma as a case study.使用人工智能机器学习算法进行物种鉴定:以Hebeloma为例
IMA Fungus. 2022 Jun 30;13(1):13. doi: 10.1186/s43008-022-00099-x.