Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, CANADA.
Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario, CANADA.
PLoS Comput Biol. 2021 Mar 12;17(3):e1008799. doi: 10.1371/journal.pcbi.1008799. eCollection 2021 Mar.
Current advances in next-generation sequencing techniques have allowed researchers to conduct comprehensive research on the microbiome and human diseases, with recent studies identifying associations between the human microbiome and health outcomes for a number of chronic conditions. However, microbiome data structure, characterized by sparsity and skewness, presents challenges to building effective classifiers. To address this, we present an innovative approach for distance-based classification using mixture distributions (DCMD). The method aims to improve classification performance using microbiome community data, where the predictors are composed of sparse and heterogeneous count data. This approach models the inherent uncertainty in sparse counts by estimating a mixture distribution for the sample data and representing each observation as a distribution, conditional on observed counts and the estimated mixture, which are then used as inputs for distance-based classification. The method is implemented into a k-means classification and k-nearest neighbours framework. We develop two distance metrics that produce optimal results. The performance of the model is assessed using simulated and human microbiome study data, with results compared against a number of existing machine learning and distance-based classification approaches. The proposed method is competitive when compared to the other machine learning approaches, and shows a clear improvement over commonly used distance-based classifiers, underscoring the importance of modelling sparsity for achieving optimal results. The range of applicability and robustness make the proposed method a viable alternative for classification using sparse microbiome count data. The source code is available at https://github.com/kshestop/DCMD for academic use.
目前,下一代测序技术的进展使得研究人员能够对微生物组和人类疾病进行全面研究,最近的研究确定了人类微生物组与许多慢性疾病的健康结果之间的关联。然而,微生物组数据的结构,具有稀疏性和偏态性,给建立有效的分类器带来了挑战。针对这一问题,我们提出了一种使用混合分布(DCMD)进行基于距离的分类的创新方法。该方法旨在利用微生物组群落数据来提高分类性能,其中预测因子由稀疏和异构的计数数据组成。该方法通过对样本数据进行混合分布估计,对稀疏计数中的固有不确定性进行建模,并将每个观测表示为在观测计数和估计混合的条件下的分布,然后将其用作基于距离的分类的输入。该方法被实现为 k-均值分类和 k-最近邻框架。我们开发了两种产生最佳结果的距离度量。使用模拟和人类微生物组研究数据评估模型的性能,并与许多现有的机器学习和基于距离的分类方法进行比较。与其他机器学习方法相比,该方法具有竞争力,并且与常用的基于距离的分类器相比,表现出明显的改进,这突显了对稀疏微生物组计数数据进行建模以获得最佳结果的重要性。该方法的适用范围和稳健性使其成为使用稀疏微生物组计数数据进行分类的可行选择。源代码可在 https://github.com/kshestop/DCMD 上获得,供学术使用。