Department of Statistics , University of Nebraska-Lincoln , Lincoln , Nebraska 68583-0963 , United States.
Department of Chemistry , University of Nebraska-Lincoln , Lincoln , Nebraska 68588-0304 , United States.
J Proteome Res. 2019 Sep 6;18(9):3282-3294. doi: 10.1021/acs.jproteome.9b00227. Epub 2019 Aug 22.
Analytical techniques such as NMR and mass spectrometry can generate large metabolomics data sets containing thousands of spectral features derived from numerous biological observations. Multivariate data analysis is routinely used to uncover the underlying biological information contained within these large metabolomics data sets. This is typically accomplished by classifying the observations into groups (e.g., control versus treated) and by identifying associated discriminating features. There are a variety of classification models to select from, which include some well-established techniques (e.g., principal component analysis [PCA], orthogonal projection to latent structure [OPLS], or partial least-squares projection to latent structures [PLS]) and newly emerging machine learning algorithms (e.g., support vector machines or random forests). However, it is unclear which classification model, if any, is an optimal choice for the analysis of metabolomics data. Herein, we present a comprehensive evaluation of five common classification models routinely employed in the metabolomics field and that are also currently available in our MVAPACK metabolomics software package. Simulated and experimental NMR data sets with various levels of group separation were used to evaluate each model. Model performance was assessed by classification accuracy rate, by the area under a receiver operating characteristic (AUROC) curve, and by the identification of true discriminating features. Our findings suggest that the five classification models perform equally well with robust data sets. Only when the models are stressed with subtle data set differences does OPLS emerge as the best-performing model. OPLS maintained a high-prediction accuracy rate and a large area under the ROC curve while yielding loadings closest to the true loadings with limited group separations.
分析技术,如 NMR 和质谱,可以生成包含数千个光谱特征的大型代谢组学数据集,这些特征来自于大量的生物学观察。多元数据分析通常用于揭示这些大型代谢组学数据集中包含的潜在生物学信息。这通常通过将观察结果分类为组(例如,对照与处理)并识别相关的区分特征来实现。有多种分类模型可供选择,包括一些成熟的技术(例如,主成分分析 [PCA]、正交投影到潜在结构 [OPLS] 或偏最小二乘投影到潜在结构 [PLS])和新出现的机器学习算法(例如,支持向量机或随机森林)。然而,尚不清楚哪种分类模型(如果有的话)是代谢组学数据分析的最佳选择。本文中,我们全面评估了代谢组学领域常用的五种常见分类模型,这些模型也可在我们的 MVAPACK 代谢组学软件包中使用。使用具有不同分组分离程度的模拟和实验 NMR 数据集来评估每个模型。通过分类准确率、接收者操作特征 (ROC) 曲线下的面积以及真实区分特征的识别来评估模型性能。我们的研究结果表明,这五种分类模型在稳健的数据集中表现相当。只有当模型受到微妙的数据集中的差异的影响时,OPLS 才会成为表现最好的模型。OPLS 在具有有限分组分离的情况下保持了较高的预测准确率和较大的 ROC 曲线下面积,同时产生了与真实载荷最接近的载荷。