Department of Statistics, University of Connecticut, Storrs, CT, USA.
Department of Computer Science and Engineering, University of Connecticut, Storrs, CT, USA.
Sci Rep. 2024 Apr 17;14(1):8855. doi: 10.1038/s41598-024-59682-4.
Health and disease are fundamentally influenced by microbial communities and their genes (the microbiome). An in-depth analysis of microbiome structure that enables the classification of individuals based on their health can be crucial in enhancing diagnostics and treatment strategies to improve the overall well-being of an individual. In this paper, we present a novel semi-supervised methodology known as Randomized Feature Selection based Latent Dirichlet Allocation (RFSLDA) to study the impact of the gut microbiome on a subject's health status. Since the data in our study consists of fuzzy health labels, which are self-reported, traditional supervised learning approaches may not be suitable. As a first step, based on the similarity between documents in text analysis and gut-microbiome data, we employ Latent Dirichlet Allocation (LDA), a topic modeling approach which uses microbiome counts as features to group subjects into relatively homogeneous clusters, without invoking any knowledge of observed health status (labels) of subjects. We then leverage information from the observed health status of subjects to associate these clusters with the most similar health status making it a semi-supervised approach. Finally, a feature selection technique is incorporated into the model to improve the overall classification performance. The proposed method provides a semi-supervised topic modelling approach that can help handle the high dimensionality of the microbiome data in association studies. Our experiments reveal that our semi-supervised classification algorithm is effective and efficient in terms of high classification accuracy compared to popular supervised learning approaches like SVM and multinomial logistic model. The RFSLDA framework is attractive because it (i) enhances clustering accuracy by identifying key bacteria types as indicators of health status, (ii) identifies key bacteria types within each group based on estimates of the proportion of bacteria types within the groups, and (iii) computes a measure of within-group similarity to identify highly similar subjects in terms of their health status.
健康和疾病从根本上受到微生物群落及其基因(微生物组)的影响。深入分析微生物组结构,能够根据个体的健康状况对其进行分类,这对于增强诊断和治疗策略,提高个体的整体健康水平至关重要。在本文中,我们提出了一种新颖的半监督方法,称为基于随机特征选择的潜在狄利克雷分配(RFSLDA),用于研究肠道微生物组对个体健康状况的影响。由于我们研究中的数据包含模糊的健康标签,这些标签是自我报告的,因此传统的监督学习方法可能并不适用。作为第一步,基于文本分析和肠道微生物组数据中文档之间的相似性,我们使用潜在狄利克雷分配(LDA),这是一种主题建模方法,它使用微生物组计数作为特征,将个体分组为相对同质的聚类,而无需调用个体观察到的健康状况(标签)的任何知识。然后,我们利用个体观察到的健康状况的信息将这些聚类与最相似的健康状况相关联,从而使该方法成为半监督方法。最后,将特征选择技术纳入模型中,以提高整体分类性能。所提出的方法提供了一种半监督主题建模方法,可以帮助处理关联研究中微生物组数据的高维度。我们的实验表明,与 SVM 和多项逻辑模型等流行的监督学习方法相比,我们的半监督分类算法在高分类准确性方面是有效和高效的。RFSLDA 框架很有吸引力,因为它 (i) 通过识别关键细菌类型作为健康状况的指标来提高聚类准确性,(ii) 根据组内细菌类型的估计值识别每个组内的关键细菌类型,以及 (iii) 计算组内相似性的度量标准,以识别健康状况高度相似的个体。