多元建模中的变量选择和验证。

Variable selection and validation in multivariate modelling.

机构信息

Department of Molecular Sciences, Swedish University of Agricultural Sciences, Uppsala SE-750 07, Sweden.

Department of Biology and Biological Engineering, Food and Nutrition Science, Chalmers University of Technology, Gothenburg SE-412 96, Sweden.

出版信息

Bioinformatics. 2019 Mar 15;35(6):972-980. doi: 10.1093/bioinformatics/bty710.

DOI:10.1093/bioinformatics/bty710

PMID:30165467

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6419897/

Abstract

MOTIVATION

Validation of variable selection and predictive performance is crucial in construction of robust multivariate models that generalize well, minimize overfitting and facilitate interpretation of results. Inappropriate variable selection leads instead to selection bias, thereby increasing the risk of model overfitting and false positive discoveries. Although several algorithms exist to identify a minimal set of most informative variables (i.e. the minimal-optimal problem), few can select all variables related to the research question (i.e. the all-relevant problem). Robust algorithms combining identification of both minimal-optimal and all-relevant variables with proper cross-validation are urgently needed.

RESULTS

We developed the MUVR algorithm to improve predictive performance and minimize overfitting and false positives in multivariate analysis. In the MUVR algorithm, minimal variable selection is achieved by performing recursive variable elimination in a repeated double cross-validation (rdCV) procedure. The algorithm supports partial least squares and random forest modelling, and simultaneously identifies minimal-optimal and all-relevant variable sets for regression, classification and multilevel analyses. Using three authentic omics datasets, MUVR yielded parsimonious models with minimal overfitting and improved model performance compared with state-of-the-art rdCV. Moreover, MUVR showed advantages over other variable selection algorithms, i.e. Boruta and VSURF, including simultaneous variable selection and validation scheme and wider applicability.

AVAILABILITY AND IMPLEMENTATION

Algorithms, data, scripts and tutorial are open source and available as an R package ('MUVR') at https://gitlab.com/CarlBrunius/MUVR.git.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

在构建稳健的多元模型时，验证变量选择和预测性能至关重要，因为这样可以很好地推广模型、最小化过拟合并促进对结果的解释。不适当的变量选择会导致选择偏差，从而增加模型过拟合和假阳性发现的风险。尽管存在几种算法可以识别最具信息量的最小变量集（即最小最优问题），但很少有算法可以选择与研究问题相关的所有变量（即所有相关问题）。迫切需要结合识别最小最优和所有相关变量以及适当的交叉验证的稳健算法。

结果

我们开发了 MUVR 算法来提高多元分析中的预测性能并最小化过拟合和假阳性。在 MUVR 算法中，通过在重复双交叉验证（rdCV）过程中执行递归变量消除来实现最小变量选择。该算法支持偏最小二乘和随机森林建模，并且可以同时为回归、分类和多层次分析识别最小最优和所有相关变量集。使用三个真实的组学数据集，MUVR 生成了简洁的模型，与最先进的 rdCV 相比，过拟合最小，模型性能得到提高。此外，MUVR 与其他变量选择算法（如 Boruta 和 VSURF）相比具有优势，包括同时的变量选择和验证方案以及更广泛的适用性。

可用性和实现

算法、数据、脚本和教程都是开源的，并可作为 R 包（'MUVR'）在 https://gitlab.com/CarlBrunius/MUVR.git 上获得。

补充信息

补充数据可在 Bioinformatics 在线获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/709a/6419897/2c026ac19a38/bty710f1.jpg

相似文献

Variable selection and validation in multivariate modelling.多元建模中的变量选择和验证。

Bioinformatics. 2019 Mar 15;35(6):972-980. doi: 10.1093/bioinformatics/bty710.

Adjusting for covariates and assessing modeling fitness in machine learning using MUVR2.使用MUVR2在机器学习中调整协变量并评估模型拟合度。

Bioinform Adv. 2024 Apr 4;4(1):vbae051. doi: 10.1093/bioadv/vbae051. eCollection 2024.

Serum Metabolomics Identifies Dysregulated Pathways and Potential Metabolic Biomarkers for Hyperuricemia and Gout.血清代谢组学鉴定高尿酸血症和痛风的失调途径和潜在代谢生物标志物。

Arthritis Rheumatol. 2021 Sep;73(9):1738-1748. doi: 10.1002/art.41733. Epub 2021 Aug 6.

Evaluation of variable selection methods for random forests and omics data sets.随机森林和组学数据集变量选择方法的评估。

Brief Bioinform. 2019 Mar 22;20(2):492-503. doi: 10.1093/bib/bbx124.

Deviance residuals-based sparse PLS and sparse kernel PLS regression for censored data.基于偏差残差的稀疏偏最小二乘和稀疏核偏最小二乘回归用于删失数据。

Bioinformatics. 2015 Feb 1;31(3):397-404. doi: 10.1093/bioinformatics/btu660. Epub 2014 Oct 6.

Performance of variable selection methods using stability-based selection.使用基于稳定性选择的变量选择方法的性能

BMC Res Notes. 2017 Apr 4;10(1):143. doi: 10.1186/s13104-017-2461-8.

Using variable combination population analysis for variable selection in multivariate calibration.在多元校准中使用可变组合总体分析进行变量选择。

Anal Chim Acta. 2015 Mar 3;862:14-23. doi: 10.1016/j.aca.2014.12.048. Epub 2014 Dec 30.

A comparative investigation of modern feature selection and classification approaches for the analysis of mass spectrometry data.用于质谱数据分析的现代特征选择与分类方法的比较研究。

Anal Chim Acta. 2014 Jun 4;829:1-8. doi: 10.1016/j.aca.2014.03.039. Epub 2014 Mar 31.

Multivariate modeling of complications with data driven variable selection: guarding against overfitting and effects of data set size.基于数据驱动变量选择的并发症的多变量建模：防止过拟合和数据集大小的影响。

Radiother Oncol. 2012 Oct;105(1):115-21. doi: 10.1016/j.radonc.2011.12.006. Epub 2012 Jan 20.

Sparse partial least squares with group and subgroup structure.稀疏偏最小二乘与分组和子分组结构。

Stat Med. 2018 Oct 15;37(23):3338-3356. doi: 10.1002/sim.7821. Epub 2018 Jun 11.

引用本文的文献

Development of mapping algorithms for gastric cancer: translating EORTC QLQ-C30 and QLQ-STO22 to EQ-5D-5L health utilities.胃癌映射算法的开发：将欧洲癌症研究与治疗组织QLQ-C30和QLQ-STO22转化为EQ-5D-5L健康效用值。

Qual Life Res. 2025 Sep 13. doi: 10.1007/s11136-025-04063-1.

Carbon substrates utilization determine antagonistic fungal-fungal interactions among root-associated fungi.碳底物的利用决定了根系相关真菌之间的拮抗真菌-真菌相互作用。

Front Microbiol. 2025 Aug 14;16:1645107. doi: 10.3389/fmicb.2025.1645107. eCollection 2025.

Metabolite Biomarkers Linking a High-Fiber Rye Intervention with Cardiometabolic Risk Factors: The RyeWeight Study.将高纤维黑麦干预与心血管代谢危险因素联系起来的代谢物生物标志物：黑麦体重研究

J Agric Food Chem. 2025 Sep 3;73(35):21869-21879. doi: 10.1021/acs.jafc.5c01415. Epub 2025 Aug 21.

Letter to the Editor: Early mortality risk factors after acute aortic dissection surgery.致编辑的信：急性主动脉夹层手术后的早期死亡风险因素。

Eur Radiol. 2025 Aug 19. doi: 10.1007/s00330-025-11955-w.

Reply to the Letter to the Editor: Early mortality risk factors after acute aortic dissection surgery.致编辑的信的回复：急性主动脉夹层手术术后早期死亡风险因素

Eur Radiol. 2025 Aug 19. doi: 10.1007/s00330-025-11956-9.

Disease activity and treatment response in early rheumatoid arthritis: an exploratory metabolomic profiling in the NORD-STAR cohort.早期类风湿关节炎的疾病活动与治疗反应：NORD - STAR队列中的探索性代谢组学分析

Arthritis Res Ther. 2025 Jul 26;27(1):156. doi: 10.1186/s13075-025-03616-6.

A Data-Driven Approach to Link GC-MS and LC-MS with Sensory Attributes of Chicken Bouillon with Added Yeast-Derived Flavor Products in a Combined Prediction Model.一种数据驱动方法，用于在联合预测模型中将气相色谱-质谱联用仪（GC-MS）和液相色谱-质谱联用仪（LC-MS）与添加酵母衍生风味产品的鸡汤的感官属性相联系。

Metabolites. 2025 May 8;15(5):317. doi: 10.3390/metabo15050317.

Biosensors, Artificial Intelligence Biosensors, False Results and Novel Future Perspectives.生物传感器、人工智能生物传感器、错误结果及新的未来展望。

Diagnostics (Basel). 2025 Apr 18;15(8):1037. doi: 10.3390/diagnostics15081037.

Omics Approach for Personalised Prevention of Type 2 Diabetes Mellitus for African and European Populations (OPTIMA): a protocol paper.针对非洲和欧洲人群的2型糖尿病个性化预防组学方法（OPTIMA）：一篇方案论文。

BMJ Open. 2025 Apr 22;15(4):e099108. doi: 10.1136/bmjopen-2025-099108.

Assessing the impact of climatic conditions and feeding systems on the quality of raw bovine milk in Spain.评估气候条件和饲养系统对西班牙生鲜牛乳质量的影响。

J Anim Sci. 2025 Jan 4;103. doi: 10.1093/jas/skaf128.

本文引用的文献

Plasma metabolites associated with type 2 diabetes in a Swedish population: a case-control study nested in a prospective cohort.在瑞典人群中与 2 型糖尿病相关的血浆代谢物：一项嵌套在前瞻性队列研究中的病例对照研究。

Diabetologia. 2018 Apr;61(4):849-861. doi: 10.1007/s00125-017-4521-y. Epub 2018 Jan 18.

Impact of sourdough fermentation on appetite and postprandial metabolic responses - a randomised cross-over trial with whole grain rye crispbread.酸面团发酵对食欲和餐后代谢反应的影响——一项关于全麦黑麦脆饼的随机交叉试验。

Br J Nutr. 2017 Nov;118(9):686-697. doi: 10.1017/S000711451700263X.

Cross-validation failure: Small sample sizes lead to large error bars.交叉验证失败：样本量小导致误差幅度大。

Neuroimage. 2018 Oct 15;180(Pt A):68-77. doi: 10.1016/j.neuroimage.2017.06.061. Epub 2017 Jun 24.

Assessing the accuracy and stability of variable selection methods for random forest modeling in ecology.评估生态随机森林建模中变量选择方法的准确性和稳定性。

Environ Monit Assess. 2017 Jul;189(7):316. doi: 10.1007/s10661-017-6025-0. Epub 2017 Jun 6.

Targeted metabolomics reveals differences in the extended postprandial plasma metabolome of healthy subjects after intake of whole-grain rye porridges versus refined wheat bread.靶向代谢组学揭示了健康受试者摄入全麦黑麦粥与精制小麦面包后餐后血浆代谢组的差异。

Mol Nutr Food Res. 2017 Jul;61(7). doi: 10.1002/mnfr.201600924. Epub 2017 Feb 27.

Assessing and tuning brain decoders: Cross-validation, caveats, and guidelines.评估与调整大脑解码器：交叉验证、注意事项及指南

Neuroimage. 2017 Jan 15;145(Pt B):166-179. doi: 10.1016/j.neuroimage.2016.10.038. Epub 2016 Oct 29.

Dimension reduction techniques for the integrative analysis of multi-omics data.用于多组学数据综合分析的降维技术

Brief Bioinform. 2016 Jul;17(4):628-41. doi: 10.1093/bib/bbv108. Epub 2016 Mar 11.

Chemometric methods in data processing of mass spectrometry-based metabolomics: A review.基于质谱代谢组学的数据处理中的化学计量学方法：综述。

Anal Chim Acta. 2016 Mar 31;914:17-34. doi: 10.1016/j.aca.2016.02.001. Epub 2016 Feb 16.

Bacterial associations reveal spatial population dynamics in Anopheles gambiae mosquitoes.细菌关联揭示了冈比亚按蚊的空间种群动态。

Sci Rep. 2016 Mar 10;6:22806. doi: 10.1038/srep22806.

The feature selection bias problem in relation to high-dimensional gene data.与高维基因数据相关的特征选择偏差问题。

Artif Intell Med. 2016 Jan;66:63-71. doi: 10.1016/j.artmed.2015.11.001. Epub 2015 Nov 14.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

多元建模中的变量选择和验证。

Variable selection and validation in multivariate modelling.

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

SUPPLEMENTARY INFORMATION

动机

结果

可用性和实现

补充信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献