Biology Department, San Diego State University, San Diego, California, United States of America.
PLoS Comput Biol. 2009 Dec;5(12):e1000593. doi: 10.1371/journal.pcbi.1000593. Epub 2009 Dec 11.
Metagenomic studies characterize both the composition and diversity of uncultured viral and microbial communities. BLAST-based comparisons have typically been used for such analyses; however, sampling biases, high percentages of unknown sequences, and the use of arbitrary thresholds to find significant similarities can decrease the accuracy and validity of estimates. Here, we present Genome relative Abundance and Average Size (GAAS), a complete software package that provides improved estimates of community composition and average genome length for metagenomes in both textual and graphical formats. GAAS implements a novel methodology to control for sampling bias via length normalization, to adjust for multiple BLAST similarities by similarity weighting, and to select significant similarities using relative alignment lengths. In benchmark tests, the GAAS method was robust to both high percentages of unknown sequences and to variations in metagenomic sequence read lengths. Re-analysis of the Sargasso Sea virome using GAAS indicated that standard methodologies for metagenomic analysis may dramatically underestimate the abundance and importance of organisms with small genomes in environmental systems. Using GAAS, we conducted a meta-analysis of microbial and viral average genome lengths in over 150 metagenomes from four biomes to determine whether genome lengths vary consistently between and within biomes, and between microbial and viral communities from the same environment. Significant differences between biomes and within aquatic sub-biomes (oceans, hypersaline systems, freshwater, and microbialites) suggested that average genome length is a fundamental property of environments driven by factors at the sub-biome level. The behavior of paired viral and microbial metagenomes from the same environment indicated that microbial and viral average genome sizes are independent of each other, but indicative of community responses to stressors and environmental conditions.
宏基因组研究描述了未培养病毒和微生物群落的组成和多样性。基于 BLAST 的比较通常用于此类分析;然而,采样偏差、高比例的未知序列以及使用任意阈值来寻找显著相似性会降低估计的准确性和有效性。在这里,我们提出了基因组相对丰度和平均大小(GAAS),这是一个完整的软件包,以文本和图形格式为宏基因组提供了改进的群落组成和平均基因组长度的估计。GAAS 实施了一种新颖的方法,通过长度归一化控制采样偏差,通过相似性加权调整多个 BLAST 相似性,并通过相对对齐长度选择显著相似性。在基准测试中,GAAS 方法对高比例的未知序列和宏基因组序列读取长度的变化都具有鲁棒性。使用 GAAS 重新分析马尾藻海病毒组表明,用于宏基因组分析的标准方法可能会大大低估环境系统中小基因组生物的丰度和重要性。使用 GAAS,我们对来自四个生物群落的 150 多个宏基因组中的微生物和病毒平均基因组长度进行了荟萃分析,以确定基因组长度是否在生物群落之间和内部以及同一环境中的微生物和病毒群落之间是否一致变化。生物群落之间和水生亚生物群落(海洋、高盐系统、淡水和微生物岩)内部的显著差异表明,平均基因组长度是由亚生物群落水平的因素驱动的环境的基本属性。来自同一环境的配对病毒和微生物宏基因组的行为表明,微生物和病毒的平均基因组大小彼此独立,但表明社区对胁迫和环境条件的反应。