IEEE Trans Nanobioscience. 2023 Oct;22(4):763-770. doi: 10.1109/TNB.2023.3283462. Epub 2023 Oct 3.
Metagenomics is an unobtrusive science linking microbial genes to biological functions or environmental states. Classifying microbial genes into their functional repertoire is an important task in the downstream analysis of Metagenomic studies. The task involves Machine Learning (ML) based supervised methods to achieve good classification performance. Random Forest (RF) has been applied rigorously to microbial gene abundance profiles, mapping them to functional phenotypes. The current research targets tuning RF by the evolutionary ancestry of microbial phylogeny, developing a Phylogeny-RF model for functional classification of metagenomes. This method facilitates capturing the effects of phylogenetic relatedness in an ML classifier itself rather than just applying a supervised classifier over the raw abundances of microbial genes. The idea is rooted in the fact that closely related microbes by phylogeny are highly correlated and tend to have similar genetic and phenotypic traits. Such microbes behave similarly; and hence tend to be selected together, or one of these could be dropped from the analysis, to improve the ML process. The proposed Phylogeny-RF algorithm has been compared with state-of-the-art classification methods including RF and the phylogeny-aware methods of MetaPhyl and PhILR, using three real-world 16S rRNA metagenomic datasets. It has been observed that the proposed method not only achieved significantly better performance than the traditional RF model but also performed better than the other phylogeny-driven benchmarks (p < 0.05). For example, Phylogeny-RF attained a highest AUC of 0.949 and Kappa of 0.891 over soil microbiomes in comparison to other benchmarks.
宏基因组学是一门不引人注目的科学,它将微生物基因与生物功能或环境状态联系起来。将微生物基因分类到其功能库中是宏基因组学研究下游分析的一项重要任务。这项任务涉及基于机器学习 (ML) 的监督方法,以实现良好的分类性能。随机森林 (RF) 已被严格应用于微生物基因丰度谱,将其映射到功能表型上。目前的研究目标是通过微生物系统发育的进化史来调整 RF,开发一种用于宏基因组功能分类的系统发育-RF 模型。该方法有助于在 ML 分类器本身中捕获系统发育相关性的影响,而不仅仅是在微生物基因的原始丰度上应用监督分类器。这一想法源于这样一个事实,即通过系统发育密切相关的微生物高度相关,并且往往具有相似的遗传和表型特征。这些微生物表现相似;因此,它们往往会被一起选择,或者其中一个可以从分析中删除,以改善 ML 过程。所提出的系统发育-RF 算法已与包括 RF 在内的最先进的分类方法以及 MetaPhyl 和 PhILR 的系统发育感知方法进行了比较,使用了三个真实的 16S rRNA 宏基因组数据集。结果表明,该方法不仅显著优于传统的 RF 模型,而且优于其他系统发育驱动的基准(p<0.05)。例如,与其他基准相比,Phylogeny-RF 在土壤微生物组中获得了最高的 AUC 为 0.949 和 Kappa 为 0.891。