1 Service de Biostatistique-Bioinformatique, Pôle Santé Publique, Hospices Civils de Lyon, Lyon, France.
2 Université de Lyon, Lyon, France.
OMICS. 2019 Apr;23(4):207-213. doi: 10.1089/omi.2018.0191. Epub 2019 Feb 22.
Big Data generated by omics technologies require simultaneous analyses of large numbers of variables. This leads to complex model selection and parameter estimates that show optimism bias. This study on simulated data sets examined optimism-bias correction by penalty regression methods in case-control studies that involve clinical and omics variables. Least absolute shrinkage and selection operator (LASSO)-based methods (LASSO-penalized logistic regression, adaptive LASSO, and regularized LASSO for selection + ridge regression) were evaluated using power, the false positive rate (FPR), false discovery rate (FDR), and by estimated versus theoretical parameter comparisons. The "ordinary" LASSO overcorrects the optimism bias. The adaptive LASSO with LASSO estimation of the weights was unable to provide a sufficient correction. Importantly, the adaptive LASSO with ridge estimation of the weights showed the best parameter estimation. The regularized LASSO selection showed a slight optimism bias that decreased with the increase in the training set size. The optimism bias decreased with the increase of the number of variables selected among truly differentially expressed variables; however, power, FPR, and FDR were correlated. A compromise between model selection and estimation accuracy should be found. These results might prove useful because Big Data analyses are becoming commonplace in omics/multiomics studies in integrative biology, precision medicine, and planetary health.
组学技术产生的大数据需要同时分析大量变量。这导致了复杂的模型选择和参数估计,表现出乐观偏差。本研究通过模拟数据集,在涉及临床和组学变量的病例对照研究中,检查了惩罚回归方法对乐观偏差的修正。基于最小绝对值收缩和选择算子(LASSO)的方法(LASSO 惩罚逻辑回归、自适应 LASSO 和用于选择的正则化 LASSO +岭回归)的评估指标包括功效、假阳性率(FPR)、假发现率(FDR)以及估计参数与理论参数的比较。“普通”的 LASSO 过度校正了乐观偏差。LASSO 权重的自适应 LASSO 无法提供足够的修正。重要的是,LASSO 权重的自适应 LASSO 与岭回归估计显示了最佳的参数估计。正则化 LASSO 选择表现出轻微的乐观偏差,随着训练集大小的增加而减小。随着真正差异表达变量中选择的变量数量的增加,乐观偏差减小;然而,功效、FPR 和 FDR 是相关的。应该在模型选择和估计准确性之间找到一个折衷。这些结果可能很有用,因为大数据分析在综合生物学、精准医学和行星健康等领域的组学/多组学研究中变得越来越普遍。