Verma Bhavish, Parkinson John
Program in Molecular Medicine, Hospital for Sick Children, Toronto, ON M5G 0A4, Canada.
Department of Molecular Genetics, University of Toronto, Toronto, ON M5S 1A8, Canada.
Bioinform Adv. 2024 Feb 1;4(1):vbae016. doi: 10.1093/bioadv/vbae016. eCollection 2024.
Whole microbiome DNA and RNA sequencing (metagenomics and metatranscriptomics) are pivotal to determining the functional roles of microbial communities. A key challenge in analyzing these complex datasets, typically composed of tens of millions of short reads, is accurately classifying reads to their taxa of origin. While still performing worse relative to reference-based short-read tools in species classification, ML algorithms have shown promising results in taxonomic classification at higher ranks. A recent approach exploited to enhance the performance of ML tools, which can be translated to reference-dependent classifiers, has been to integrate the hierarchical structure of taxonomy within the tool's predictive algorithm.
Here, we introduce HiTaxon, an end-to-end hierarchical ensemble framework for taxonomic classification. HiTaxon facilitates data collection and processing, reference database construction and optional training of ML models to streamline ensemble creation. We show that databases created by HiTaxon improve the species-level performance of reference-dependent classifiers, while reducing their computational overhead. In addition, through exploring hierarchical methods for HiTaxon, we highlight that our custom approach to hierarchical ensembling improves species-level classification relative to traditional strategies. Finally, we demonstrate the improved performance of our hierarchical ensembles over current state-of-the-art classifiers in species classification using datasets comprised of either simulated or experimentally derived reads.
HiTaxon is available at: https://github.com/ParkinsonLab/HiTaxon.
全微生物组DNA和RNA测序(宏基因组学和宏转录组学)对于确定微生物群落的功能作用至关重要。分析这些通常由数千万条短读段组成的复杂数据集的一个关键挑战是将读段准确分类到其来源的分类单元。虽然在物种分类方面相对于基于参考的短读段工具仍表现较差,但机器学习算法在更高分类等级的分类学分类中已显示出有前景的结果。一种用于提高机器学习工具性能(可转化为依赖参考的分类器)的最新方法是在工具的预测算法中整合分类学的层次结构。
在此,我们介绍HiTaxon,一种用于分类学分类的端到端层次集成框架。HiTaxon便于数据收集与处理、参考数据库构建以及机器学习模型的可选训练,以简化集成创建。我们表明,由HiTaxon创建的数据库提高了依赖参考的分类器在物种水平的性能,同时减少了其计算开销。此外,通过探索HiTaxon的层次方法,我们强调相对于传统策略,我们的定制层次集成方法提高了物种水平的分类。最后,我们使用由模拟或实验衍生的读段组成的数据集,证明了我们的层次集成在物种分类方面相对于当前最先进的分类器具有更高的性能。
HiTaxon可在以下网址获取:https://github.com/ParkinsonLab/HiTaxon。