Center for Health Informatics and Bioinformatics, New York University Langone Medical Center, 227 East 30th Street, New York, NY, USA.
Microbiome. 2013 Apr 5;1(1):11. doi: 10.1186/2049-2618-1-11.
Recent advances in next-generation DNA sequencing enable rapid high-throughput quantitation of microbial community composition in human samples, opening up a new field of microbiomics. One of the promises of this field is linking abundances of microbial taxa to phenotypic and physiological states, which can inform development of new diagnostic, personalized medicine, and forensic modalities. Prior research has demonstrated the feasibility of applying machine learning methods to perform body site and subject classification with microbiomic data. However, it is currently unknown which classifiers perform best among the many available alternatives for classification with microbiomic data.
In this work, we performed a systematic comparison of 18 major classification methods, 5 feature selection methods, and 2 accuracy metrics using 8 datasets spanning 1,802 human samples and various classification tasks: body site and subject classification and diagnosis.
We found that random forests, support vector machines, kernel ridge regression, and Bayesian logistic regression with Laplace priors are the most effective machine learning techniques for performing accurate classification from these microbiomic data.
新一代 DNA 测序技术的进步使得对人体样本中微生物群落组成进行快速高通量定量成为可能,开辟了微生物组学的新领域。该领域的一个承诺是将微生物分类群的丰度与表型和生理状态联系起来,从而为开发新的诊断、个性化医学和法医模式提供信息。先前的研究已经证明了应用机器学习方法对微生物组学数据进行体部位和个体分类的可行性。然而,目前尚不清楚在用于微生物组学数据分类的众多替代方案中,哪种分类器的性能最好。
在这项工作中,我们使用 8 个数据集(涵盖 1802 个人体样本)和各种分类任务(体部位和个体分类和诊断),对 18 种主要分类方法、5 种特征选择方法和 2 种准确性指标进行了系统比较。
我们发现随机森林、支持向量机、核脊回归和贝叶斯逻辑回归与拉普拉斯先验是从这些微生物组学数据中进行准确分类的最有效机器学习技术。