Xi Bowei, Gu Haiwei, Baniasadi Hamid, Raftery Daniel
Department of Statistics, Purdue University, 250 North University Street, West Lafayette, IN, 47907, USA,
Methods Mol Biol. 2014;1198:333-53. doi: 10.1007/978-1-4939-1258-2_22.
Multivariate statistical techniques are used extensively in metabolomics studies, ranging from biomarker selection to model building and validation. Two model independent variable selection techniques, principal component analysis and two sample t-tests are discussed in this chapter, as well as classification and regression models and model related variable selection techniques, including partial least squares, logistic regression, support vector machine, and random forest. Model evaluation and validation methods, such as leave-one-out cross-validation, Monte Carlo cross-validation, and receiver operating characteristic analysis, are introduced with an emphasis to avoid over-fitting the data. The advantages and the limitations of the statistical techniques are also discussed in this chapter.
多变量统计技术在代谢组学研究中被广泛应用,从生物标志物选择到模型构建与验证。本章讨论了两种与模型无关的变量选择技术,即主成分分析和双样本t检验,以及分类和回归模型与模型相关的变量选择技术,包括偏最小二乘法、逻辑回归、支持向量机和随机森林。还介绍了模型评估和验证方法,如留一法交叉验证、蒙特卡罗交叉验证和受试者工作特征分析,重点是避免数据过度拟合。本章还讨论了统计技术的优点和局限性。