基于机器学习和数据库的方法在高通量测序数据分类中的应用与比较。

Application and Comparison of Machine Learning and Database-Based Methods in Taxonomic Classification of High-Throughput Sequencing Data.

机构信息

Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China.

Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324003 China.

出版信息

Genome Biol Evol. 2024 May 2;16(5). doi: 10.1093/gbe/evae102.

DOI:10.1093/gbe/evae102

PMID:38748485

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11135637/

Abstract

The advent of high-throughput sequencing technologies has not only revolutionized the field of bioinformatics but has also heightened the demand for efficient taxonomic classification. Despite technological advancements, efficiently processing and analyzing the deluge of sequencing data for precise taxonomic classification remains a formidable challenge. Existing classification approaches primarily fall into two categories, database-based methods and machine learning methods, each presenting its own set of challenges and advantages. On this basis, the aim of our study was to conduct a comparative analysis between these two methods while also investigating the merits of integrating multiple database-based methods. Through an in-depth comparative study, we evaluated the performance of both methodological categories in taxonomic classification by utilizing simulated data sets. Our analysis revealed that database-based methods excel in classification accuracy when backed by a rich and comprehensive reference database. Conversely, while machine learning methods show superior performance in scenarios where reference sequences are sparse or lacking, they generally show inferior performance compared with database methods under most conditions. Moreover, our study confirms that integrating multiple database-based methods does, in fact, enhance classification accuracy. These findings shed new light on the taxonomic classification of high-throughput sequencing data and bear substantial implications for the future development of computational biology. For those interested in further exploring our methods, the source code of this study is publicly available on https://github.com/LoadStar822/Genome-Classifier-Performance-Evaluator. Additionally, a dedicated webpage showcasing our collected database, data sets, and various classification software can be found at http://lab.malab.cn/~tqz/project/taxonomic/.

摘要

高通量测序技术的出现不仅彻底改变了生物信息学领域，也对高效的分类学提出了更高的要求。尽管技术在不断进步，但有效地处理和分析大量测序数据以进行准确的分类仍然是一个巨大的挑战。现有的分类方法主要分为两类，基于数据库的方法和机器学习方法，每种方法都有其自身的一系列挑战和优势。在此基础上，我们的研究旨在对这两种方法进行比较分析，同时研究整合多种基于数据库的方法的优点。通过深入的比较研究，我们利用模拟数据集评估了这两种方法类别在分类中的性能。我们的分析表明，基于数据库的方法在拥有丰富全面的参考数据库的情况下，在分类准确性方面表现出色。相比之下，机器学习方法在参考序列稀疏或缺乏的情况下表现出更好的性能，但在大多数情况下，它们的性能一般不如基于数据库的方法。此外，我们的研究证实，整合多种基于数据库的方法确实可以提高分类准确性。这些发现为高通量测序数据的分类学提供了新的视角，对计算生物学的未来发展具有重要意义。对于有兴趣进一步探索我们的方法的人，可以在 https://github.com/LoadStar822/Genome-Classifier-Performance-Evaluator 上获得本研究的源代码。此外，还可以在 http://lab.malab.cn/~tqz/project/taxonomic/ 上找到我们收集的数据库、数据集和各种分类软件的专用网页。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

基于机器学习和数据库的方法在高通量测序数据分类中的应用与比较。

Application and Comparison of Machine Learning and Database-Based Methods in Taxonomic Classification of High-Throughput Sequencing Data.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

基于机器学习和数据库的方法在高通量测序数据分类中的应用与比较。

Application and Comparison of Machine Learning and Database-Based Methods in Taxonomic Classification of High-Throughput Sequencing Data.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献