Ayyala Deepak Nag, Lin Shili
Department of Statistics, The Ohio State University, Columbus, OH 43210, USA.
Bioinformatics. 2015 May 15;31(10):1648-54. doi: 10.1093/bioinformatics/btv032. Epub 2015 Jan 20.
Microbiota compositions have great implications in human health, such as obesity and other conditions. As such, it is of great importance to cluster samples or taxa to visualize and discover community substructures. Graphical representation of metagenomic count data relies on two aspects, measure of dissimilarity between samples/taxa and algorithm used to estimate coordinates to study microbiota communities. UniFrac is a dissimilarity measure commonly used in metagenomic research, but it requires a phylogenetic tree. Principal coordinate analysis (PCoA) is a popular algorithm for estimating two-dimensional (2D) coordinates for graphical representation, although alternative and higher-dimensional representations may reveal underlying community substructures invisible in 2D representations.
We adapt a new measure of dissimilarity, penalized Kendall's τ-distance, which does not depend on a phylogenetic tree, and hence more readily applicable to a wider class of problems. Further, we propose to use metric multidimensional scaling (MDS) as an alternative to PCoA for graphical representation. We then devise a novel procedure for determining the number of clusters in conjunction with PAM (mPAM). We show superior performances with higher-dimensional representations. We further demonstrate the utility of mPAM for accurate clustering analysis, especially with higher-dimensional MDS models. Applications to two human microbiota datasets illustrate greater insights into the subcommunity structure with a higher-dimensional analysis.
微生物群组成对人类健康具有重大影响,如肥胖和其他病症。因此,对样本或分类群进行聚类以可视化和发现群落子结构非常重要。宏基因组计数数据的图形表示依赖于两个方面,即样本/分类群之间的差异度量以及用于估计坐标以研究微生物群群落的算法。UniFrac是宏基因组研究中常用的一种差异度量,但它需要一个系统发育树。主坐标分析(PCoA)是一种用于估计二维(2D)坐标以进行图形表示的流行算法,尽管其他更高维的表示可能会揭示二维表示中不可见的潜在群落子结构。
我们采用了一种新的差异度量方法,即惩罚肯德尔τ距离,它不依赖于系统发育树,因此更易于应用于更广泛的一类问题。此外,我们建议使用度量多维尺度分析(MDS)作为PCoA的替代方法进行图形表示。然后,我们设计了一种与PAM(mPAM)相结合的确定聚类数的新方法。我们展示了更高维表示的优越性能。我们进一步证明了mPAM在准确聚类分析中的效用,特别是对于更高维的MDS模型。对两个人类微生物群数据集的应用表明,通过更高维分析可以更深入地了解亚群落结构。