Department of Biopharmaceutics and Pharmacodynamics, Medical University of Gdańsk Gdańsk, Poland.
Front Mol Biosci. 2016 Jul 26;3:35. doi: 10.3389/fmolb.2016.00035. eCollection 2016.
Non-targeted metabolomics constitutes a part of the systems biology and aims at determining numerous metabolites in complex biological samples. Datasets obtained in the non-targeted metabolomics studies are high-dimensional due to sensitivity of mass spectrometry-based detection methods as well as complexity of biological matrices. Therefore, a proper selection of variables which contribute into group classification is a crucial step, especially in metabolomics studies which are focused on searching for disease biomarker candidates. In the present study, three different statistical approaches were tested using two metabolomics datasets (RH and PH study). The orthogonal projections to latent structures-discriminant analysis (OPLS-DA) without and with multiple testing correction as well as the least absolute shrinkage and selection operator (LASSO) with bootstrapping, were tested and compared. For the RH study, OPLS-DA model built without multiple testing correction selected 46 and 218 variables based on the VIP criteria using Pareto and UV scaling, respectively. For the PH study, 217 and 320 variables were selected based on the VIP criteria using Pareto and UV scaling, respectively. In the RH study, OPLS-DA model built after correcting for multiple testing, selected 4 and 19 variables as in terms of Pareto and UV scaling, respectively. For the PH study, 14 and 18 variables were selected based on the VIP criteria in terms of Pareto and UV scaling, respectively. In the RH and PH study, the LASSO selected 14 and 4 variables with reproducibility between 99.3 and 100%, respectively. In the light of PLS-based models, the larger the search space the higher the probability of developing models that fit the training data well with simultaneous poor predictive performance on the validation set. The LASSO offers potential improvements over standard linear regression due to the presence of the constrain, which promotes sparse solutions. This paper is the first one to date utilizing the LASSO penalized logistic regression in untargeted metabolomics studies.
非靶向代谢组学属于系统生物学的一部分,旨在确定复杂生物样本中的众多代谢物。由于基于质谱的检测方法的灵敏度以及生物基质的复杂性,非靶向代谢组学研究获得的数据集是高维的。因此,选择有助于组分类的变量是至关重要的一步,特别是在针对寻找疾病生物标志物候选物的代谢组学研究中。在本研究中,使用两种代谢组学数据集(RH 和 PH 研究)测试了三种不同的统计方法。测试并比较了没有和有多重检验校正的正交投影到潜在结构判别分析(OPLS-DA)以及具有引导功能的最小绝对收缩和选择算子(LASSO)。对于 RH 研究,基于 VIP 标准,在没有多重检验校正的情况下,OPLS-DA 模型使用 Pareto 和 UV 标度分别选择了 46 和 218 个变量。对于 PH 研究,基于 VIP 标准,使用 Pareto 和 UV 标度分别选择了 217 和 320 个变量。在 RH 研究中,在进行多重检验校正后,OPLS-DA 模型使用 Pareto 和 UV 标度分别选择了 4 和 19 个变量。对于 PH 研究,基于 Pareto 和 UV 标度,分别选择了 14 和 18 个变量基于 VIP 标准。在 RH 和 PH 研究中,LASSO 分别以 99.3%到 100%的再现性选择了 14 和 4 个变量。根据基于 PLS 的模型,搜索空间越大,开发出适合训练数据的模型的概率就越高,同时对验证集的预测性能也越差。LASSO 通过存在约束来促进稀疏解,从而提供了比标准线性回归更好的潜力。本文是迄今为止首次将 LASSO 惩罚逻辑回归应用于非靶向代谢组学研究。