Microbiome Research Initiative, Fred Hutchinson Cancer Research Center, Seattle, Washington, USA.
Department of Biostatistics, University of Washington, Seattle, Washington, USA.
Microbiome. 2019 Aug 1;7(1):110. doi: 10.1186/s40168-019-0722-6.
Whole-genome "shotgun" (WGS) metagenomic sequencing is an increasingly widely used tool for analyzing the metagenomic content of microbiome samples. While WGS data contains gene-level information, it can be challenging to analyze the millions of microbial genes which are typically found in microbiome experiments. To mitigate the ultrahigh dimensionality challenge of gene-level metagenomics, it has been proposed to cluster genes by co-abundance to form Co-Abundant Gene groups (CAGs). However, exhaustive co-abundance clustering of millions of microbial genes across thousands of biological samples has previously been intractable purely due to the computational challenge of performing trillions of pairwise comparisons.
Here we present a novel computational approach to the analysis of WGS datasets in which microbial gene groups are the fundamental unit of analysis. We use the Approximate Nearest Neighbor heuristic for near-exhaustive average linkage clustering to group millions of genes by co-abundance. This results in thousands of high-quality CAGs representing complete and partial microbial genomes. We applied this method to publicly available WGS microbiome surveys and found that the resulting microbial CAGs associated with inflammatory bowel disease (IBD) and colorectal cancer (CRC) were highly reproducible and could be validated independently using multiple independent cohorts.
This powerful approach to gene-level metagenomics provides a powerful path forward for identifying the biological links between the microbiome and human health. By proposing a new computational approach for handling high dimensional metagenomics data, we identified specific microbial gene groups that are associated with disease that can be used to identify strains of interest for further preclinical and mechanistic experimentation.
全基因组“鸟枪法”(WGS)宏基因组测序是一种越来越广泛用于分析微生物组样本宏基因组内容的工具。虽然 WGS 数据包含基因水平的信息,但分析微生物组实验中通常发现的数百万个微生物基因具有挑战性。为了减轻基因水平宏基因组学的超高维性挑战,已经提出通过共丰度对基因进行聚类,形成共丰度基因群(CAG)。然而,由于要执行数万万亿次两两比较的计算挑战,以前纯粹由于计算上的困难,无法对数千个生物样本中的数百万个微生物基因进行详尽的共丰度聚类。
在这里,我们提出了一种新的计算方法来分析 WGS 数据集,其中微生物基因群是分析的基本单位。我们使用近似最近邻启发式算法进行近乎详尽的平均链接聚类,根据共丰度对数百万个基因进行分组。这导致了数千个高质量的 CAG,代表完整和部分微生物基因组。我们将这种方法应用于公开的 WGS 微生物组调查,发现与炎症性肠病(IBD)和结直肠癌(CRC)相关的微生物 CAG 高度可重复,并且可以使用多个独立队列独立验证。
这种强大的基因水平宏基因组学方法为确定微生物组与人类健康之间的生物学联系提供了一条强有力的途径。通过提出一种处理高维宏基因组学数据的新计算方法,我们确定了与疾病相关的特定微生物基因群,可用于鉴定进一步进行临床前和机制实验的感兴趣菌株。