Department of Computer Science, University of Colorado, Boulder, CO, USA.
FEMS Microbiol Rev. 2011 Mar;35(2):343-59. doi: 10.1111/j.1574-6976.2010.00251.x. Epub 2010 Oct 7.
Recent advances in DNA sequencing technology have allowed the collection of high-dimensional data from human-associated microbial communities on an unprecedented scale. A major goal of these studies is the identification of important groups of microorganisms that vary according to physiological or disease states in the host, but the incidence of rare taxa and the large numbers of taxa observed make that goal difficult to obtain using traditional approaches. Fortunately, similar problems have been addressed by the machine learning community in other fields of study such as microarray analysis and text classification. In this review, we demonstrate that several existing supervised classifiers can be applied effectively to microbiota classification, both for selecting subsets of taxa that are highly discriminative of the type of community, and for building models that can accurately classify unlabeled data. To encourage the development of new approaches to supervised classification of microbiota, we discuss several structures inherent in microbial community data that may be available for exploitation in novel approaches, and we include as supplemental information several benchmark classification tasks for use by the community.
近年来,DNA 测序技术的进步使得人们能够以前所未有的规模从与人类相关的微生物群落中收集高维数据。这些研究的主要目标之一是确定根据宿主的生理或疾病状态而变化的重要微生物群体,但由于稀有分类单元的发生率和观察到的大量分类单元,使用传统方法很难实现这一目标。幸运的是,机器学习社区在其他研究领域(如微阵列分析和文本分类)已经解决了类似的问题。在这篇综述中,我们证明了几种现有的监督分类器可以有效地应用于微生物群落分类,既可以选择对群落类型具有高度判别能力的分类单元子集,也可以构建能够准确对未标记数据进行分类的模型。为了鼓励开发用于微生物 supervised 分类的新方法,我们讨论了微生物群落数据中可能可用于新方法开发的几种固有结构,并包括了一些供社区使用的基准分类任务作为补充信息。