Department of Mathematics and Statistics, P.O.Box 68 (Gustaf Hällströmin katu 2b), University of Helsinki, 00014 Helsinki, Finland.
Nucleic Acids Res. 2012 Jul;40(12):5240-9. doi: 10.1093/nar/gks227. Epub 2012 Mar 9.
Estimating bacterial community composition from a mixed sample in different applied contexts is an important task for many microbiologists. The bacterial community composition is commonly estimated by clustering polymerase chain reaction amplified 16S rRNA gene sequences. Current taxonomy-independent clustering methods for analyzing these sequences, such as UCLUST, ESPRIT-Tree and CROP, have two limitations: (i) expert knowledge is needed, i.e. a difference cutoff between species needs to be specified; (ii) closely related species cannot be separated. The first limitation imposes a burden on the user, since considerable effort is needed to select appropriate parameters, whereas the second limitation leads to an inaccurate description of the underlying bacterial community composition. We propose a probabilistic model-based method to estimate bacterial community composition which tackles these limitations. Our method requires very little expert knowledge, where only the possible maximum number of clusters needs to be specified. Also our method demonstrates its ability to separate closely related species in two experiments, in spite of sequencing errors and individual variations.
从不同应用环境的混合样本中估计细菌群落组成是许多微生物学家的重要任务。细菌群落组成通常通过聚类聚合酶链反应扩增的 16S rRNA 基因序列来估计。目前用于分析这些序列的无需依赖于分类的聚类方法,如 UCLUST、ESPRIT-Tree 和 CROP,具有两个局限性:(i)需要专业知识,即需要指定物种之间的差异截止值;(ii)无法分离密切相关的物种。第一个限制给用户带来了负担,因为选择合适的参数需要相当大的努力,而第二个限制导致对基础细菌群落组成的描述不准确。我们提出了一种基于概率模型的方法来估计细菌群落组成,该方法解决了这些限制。我们的方法几乎不需要专业知识,只需要指定可能的最大聚类数。此外,我们的方法在两个实验中展示了其分离密切相关物种的能力,尽管存在测序错误和个体变异。