Department of Microbiology and Plant Biology, Institute for Environmental Genomics, University of Oklahoma, Norman, OK 73072, USA, Earth Science Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA and State Key Joint Laboratory of Environmental Simulation and Pollution Control, School of Environment, Tsinghua University, Beijing 100084, China.
Nucleic Acids Res. 2014 Apr;42(8):e67. doi: 10.1093/nar/gku138. Epub 2014 Feb 12.
Shotgun metagenome sequencing has become a fast, cheap and high-throughput technology for characterizing microbial communities in complex environments and human body sites. However, accurate identification of microorganisms at the strain/species level remains extremely challenging. We present a novel k-mer-based approach, termed GSMer, that identifies genome-specific markers (GSMs) from currently sequenced microbial genomes, which were then used for strain/species-level identification in metagenomes. Using 5390 sequenced microbial genomes, 8 770 321 50-mer strain-specific and 11 736 360 species-specific GSMs were identified for 4088 strains and 2005 species (4933 strains), respectively. The GSMs were first evaluated against mock community metagenomes, recently sequenced genomes and real metagenomes from different body sites, suggesting that the identified GSMs were specific to their targeting genomes. Sensitivity evaluation against synthetic metagenomes with different coverage suggested that 50 GSMs per strain were sufficient to identify most microbial strains with ≥0.25× coverage, and 10% of selected GSMs in a database should be detected for confident positive callings. Application of GSMs identified 45 and 74 microbial strains/species significantly associated with type 2 diabetes patients and obese/lean individuals from corresponding gastrointestinal tract metagenomes, respectively. Our result agreed with previous studies but provided strain-level information. The approach can be directly applied to identify microbial strains/species from raw metagenomes, without the effort of complex data pre-processing.
shotgun 宏基因组测序已成为一种快速、廉价和高通量的技术,可用于描述复杂环境和人体部位的微生物群落。然而,在菌株/种水平上准确识别微生物仍然极具挑战性。我们提出了一种新的基于 k-mer 的方法,称为 GSMer,它可以从当前测序的微生物基因组中识别出基因组特异性标记 (GSM),然后用于宏基因组中的菌株/种水平鉴定。使用 5390 个测序微生物基因组,我们分别为 4088 株和 2005 种 (4933 株) 鉴定了 8770321 个 50-mer 菌株特异性和 11736360 个种特异性 GSM。我们首先将这些 GSM 与模拟群落宏基因组、最近测序的基因组和来自不同身体部位的真实宏基因组进行了评估,表明鉴定的 GSM 是针对其靶向基因组的特异性。针对不同覆盖率的合成宏基因组的敏感性评估表明,每个菌株有 50 个 GSM 足以识别大多数覆盖率≥0.25×的微生物菌株,并且数据库中 10%的选定 GSM 应该被检测到,以进行有信心的阳性调用。GSM 的应用分别从相应的胃肠道宏基因组中鉴定出 45 株和 74 株与 2 型糖尿病患者和肥胖/瘦弱个体显著相关的微生物菌株/种。我们的结果与先前的研究一致,但提供了菌株水平的信息。该方法可以直接应用于从原始宏基因组中识别微生物菌株/种,而无需进行复杂的数据预处理。