MERIT Theme: Biomedical Engineering, Department of Mechanical Engineering, Melbourne School of Engineering, The University of Melbourne, VIC 3010, Australia.
Nucleic Acids Res. 2012 Mar;40(5):e34. doi: 10.1093/nar/gkr1204. Epub 2011 Dec 17.
An approach to infer the unknown microbial population structure within a metagenome is to cluster nucleotide sequences based on common patterns in base composition, otherwise referred to as binning. When functional roles are assigned to the identified populations, a deeper understanding of microbial communities can be attained, more so than gene-centric approaches that explore overall functionality. In this study, we propose an unsupervised, model-based binning method with two clustering tiers, which uses a novel transformation of the oligonucleotide frequency-derived error gradient and GC content to generate coarse groups at the first tier of clustering; and tetranucleotide frequency to refine these groups at the secondary clustering tier. The proposed method has a demonstrated improvement over PhyloPythia, S-GSOM, TACOA and TaxSOM on all three benchmarks that were used for evaluation in this study. The proposed method is then applied to a pyrosequenced metagenomic library of mud volcano sediment sampled in southwestern Taiwan, with the inferred population structure validated against complementary sequencing of 16S ribosomal RNA marker genes. Finally, the proposed method was further validated against four publicly available metagenomes, including a highly complex Antarctic whale-fall bone sample, which was previously assumed to be too complex for binning prior to functional analysis.
一种推断宏基因组中未知微生物种群结构的方法是根据碱基组成中的常见模式对核苷酸序列进行聚类,也称为分箱。当将功能作用分配给已识别的种群时,可以更深入地了解微生物群落,而不仅仅是探索整体功能的基于基因的方法。在这项研究中,我们提出了一种无监督的、基于模型的分箱方法,该方法具有两个聚类层,使用寡核苷酸频率衍生的误差梯度和 GC 含量的新变换在聚类的第一层生成粗分组; 并使用四核苷酸频率在二级聚类层中细化这些分组。与用于本研究评估的所有三个基准测试中的 PhyloPythia、S-GSOM、TACOA 和 TaxSOM 相比,所提出的方法具有明显的改进。然后将该方法应用于从台湾西南部泥火山沉积物中提取的焦磷酸测序宏基因组文库,并将推断的种群结构与 16S 核糖体 RNA 标记基因的互补测序进行验证。最后,该方法进一步针对四个公开可用的宏基因组进行了验证,包括一个非常复杂的南极鲸落骨样本,该样本之前在进行功能分析之前被认为太复杂而无法分箱。