Chen Guanhua, Wang Xinyue, Sun Qiang, Tang Zheng-Zheng
Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI 53726, United States.
Department of Statistics, Pennsylvania State University, University Park, PA 16802, United States.
Bioinformatics. 2025 Feb 4;41(2). doi: 10.1093/bioinformatics/btaf042.
Clustering patients into subgroups based on their microbial compositions can greatly enhance our understanding of the role of microbes in human health and disease etiology. Distance-based clustering methods, such as partitioning around medoids (PAM), are popular due to their computational efficiency and absence of distributional assumptions. However, the performance of these methods can be suboptimal when true cluster memberships are driven by differences in the abundance of only a few microbes, a situation known as the sparse signal scenario.
We demonstrate that classical multidimensional scaling (MDS), a widely used dimensionality reduction technique, effectively denoises microbiome data and enhances the clustering performance of distance-based methods. We propose a two-step procedure that first applies MDS to project high-dimensional microbiome data into a low-dimensional space, followed by distance-based clustering using the low-dimensional data. Our extensive simulations demonstrate that our procedure offers superior performance compared to directly conducting distance-based clustering under the sparse signal scenario. The advantage of our procedure is further showcased in several real data applications.
The R package MDSMClust is available at https://github.com/wxy929/MDS-project.
根据微生物组成将患者聚类为亚组,能够极大地增进我们对微生物在人类健康和疾病病因学中作用的理解。基于距离的聚类方法,如围绕中心点划分法(PAM),因其计算效率高且无需分布假设而广受欢迎。然而,当真正的聚类成员关系仅由少数微生物丰度差异驱动时,即所谓的稀疏信号情形,这些方法的性能可能并不理想。
我们证明,经典多维尺度分析(MDS)这一广泛使用的降维技术,能有效去除微生物组数据中的噪声,并提升基于距离的方法的聚类性能。我们提出了一个两步程序,首先应用MDS将高维微生物组数据投影到低维空间,然后使用低维数据进行基于距离的聚类。我们广泛的模拟表明,在稀疏信号情形下,与直接进行基于距离的聚类相比,我们的程序具有更优的性能。我们的程序在几个实际数据应用中进一步展现了其优势。
R包MDSMClust可在https://github.com/wxy929/MDS-project获取。