具有相对高维数据的推断模型的变量选择：稳健选择的辅助方法——方法异质性和协变量稳定性。

Variable selection for inferential models with relatively high-dimensional data: Between method heterogeneity and covariate stability as adjuncts to robust selection.

机构信息

School of Veterinary Medicine and Science, University of Nottingham, Sutton Bonington Campus, Leicestershire, LE12 5RD, United Kingdom.

OIE, World Organisation for Animal Health 12, rue de Prony, 75017, Paris, France.

出版信息

Sci Rep. 2020 May 14;10(1):8002. doi: 10.1038/s41598-020-64829-0.

DOI:10.1038/s41598-020-64829-0

PMID:32409668

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7224285/

Abstract

Variable selection in inferential modelling is problematic when the number of variables is large relative to the number of data points, especially when multicollinearity is present. A variety of techniques have been described to identify 'important' subsets of variables from within a large parameter space but these may produce different results which creates difficulties with inference and reproducibility. Our aim was evaluate the extent to which variable selection would change depending on statistical approach and whether triangulation across methods could enhance data interpretation. A real dataset containing 408 subjects, 337 explanatory variables and a normally distributed outcome was used. We show that with model hyperparameters optimised to minimise cross validation error, ten methods of automated variable selection produced markedly different results; different variables were selected and model sparsity varied greatly. Comparison between multiple methods provided valuable additional insights. Two variables that were consistently selected and stable across all methods accounted for the majority of the explainable variability; these were the most plausible important candidate variables. Further variables of importance were identified from evaluating selection stability across all methods. In conclusion, triangulation of results across methods, including use of covariate stability, can greatly enhance data interpretation and confidence in variable selection.

摘要

当变量的数量相对于数据点的数量较大时，推理建模中的变量选择是有问题的，尤其是存在多重共线性时。已经描述了多种技术来从大参数空间中识别“重要”变量子集，但这些技术可能会产生不同的结果，从而给推理和可重复性带来困难。我们的目的是评估变量选择会在多大程度上因统计方法而异，以及跨方法的三角测量是否可以增强数据解释。使用包含 408 个主题、337 个解释变量和正态分布结果的真实数据集。我们表明，通过优化模型超参数以最小化交叉验证误差，十种自动化变量选择方法产生了明显不同的结果；选择了不同的变量，模型稀疏度差异很大。对多种方法进行比较提供了有价值的附加见解。两个在所有方法中都被一致选择且稳定的变量占可解释变异性的大部分；这些是最合理的重要候选变量。从评估所有方法的选择稳定性中确定了其他重要变量。总之，跨方法结果的三角测量，包括使用协变量稳定性，可以极大地增强数据解释和对变量选择的信心。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c6e8/7224285/28ddd5eb14ca/41598_2020_64829_Fig1_HTML.jpg

相似文献

Variable selection for inferential models with relatively high-dimensional data: Between method heterogeneity and covariate stability as adjuncts to robust selection.具有相对高维数据的推断模型的变量选择：稳健选择的辅助方法——方法异质性和协变量稳定性。

Sci Rep. 2020 May 14;10(1):8002. doi: 10.1038/s41598-020-64829-0.

Stability selection for mixed effect models with large numbers of predictor variables: A simulation study.具有大量预测变量的混合效应模型的稳定性选择：一项模拟研究。

Prev Vet Med. 2022 Sep;206:105714. doi: 10.1016/j.prevetmed.2022.105714. Epub 2022 Jul 12.

Model selection for inferential models with high dimensional data: synthesis and graphical representation of multiple techniques.高维数据推理模型的模型选择：多种技术的综合与图形表示

Sci Rep. 2021 Jan 11;11(1):412. doi: 10.1038/s41598-020-79317-8.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区，服用抗叶酸抗疟药物的人群中，叶酸补充剂与疟疾易感性和严重程度的关系。

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

Learning mixed graphical models with separate sparsity parameters and stability-based model selection.学习具有单独稀疏参数和基于稳定性的模型选择的混合图形模型。

BMC Bioinformatics. 2016 Jun 6;17 Suppl 5(Suppl 5):175. doi: 10.1186/s12859-016-1039-0.

Overcoming the problem of multicollinearity in sports performance data: A novel application of partial least squares correlation analysis.克服运动表现数据中的多重共线性问题：偏最小二乘相关分析的新应用。

PLoS One. 2019 Feb 14;14(2):e0211776. doi: 10.1371/journal.pone.0211776. eCollection 2019.

Performance of variable selection methods using stability-based selection.使用基于稳定性选择的变量选择方法的性能

BMC Res Notes. 2017 Apr 4;10(1):143. doi: 10.1186/s13104-017-2461-8.

Small class sizes for improving student achievement in primary and secondary schools: a systematic review.小班教学对提高中小学学生成绩的影响：一项系统综述。

Campbell Syst Rev. 2018 Oct 11;14(1):1-107. doi: 10.4073/csr.2018.10. eCollection 2018.

Reproducibility, complementary measure of predictability for robustness improvement of multivariate calibration models via variable selections.通过变量选择提高多元校准模型稳健性的可预测性和可重复性的互补度量。

Anal Chim Acta. 2012 Dec 13;757:11-8. doi: 10.1016/j.aca.2012.10.025. Epub 2012 Nov 1.

Assessing the accuracy and stability of variable selection methods for random forest modeling in ecology.评估生态随机森林建模中变量选择方法的准确性和稳定性。

Environ Monit Assess. 2017 Jul;189(7):316. doi: 10.1007/s10661-017-6025-0. Epub 2017 Jun 6.

引用本文的文献

Variable selection methods for descriptive modeling.用于描述性建模的变量选择方法。

PLoS One. 2025 Jun 2;20(6):e0321601. doi: 10.1371/journal.pone.0321601. eCollection 2025.

Assessing the length of hospital stay for patients with myasthenia gravis based on the data mining MARS approach.基于数据挖掘MARS方法评估重症肌无力患者的住院时长。

Front Neurol. 2023 Dec 14;14:1283214. doi: 10.3389/fneur.2023.1283214. eCollection 2023.

Farm biosecurity measures to prevent hepatitis E virus infection in finishing pigs on endemically infected pig farms.在地方性感染猪场中预防育肥猪戊型肝炎病毒感染的农场生物安全措施。

One Health. 2023 May 27;16:100570. doi: 10.1016/j.onehlt.2023.100570. eCollection 2023 Jun.

Computational approaches for network-based integrative multi-omics analysis.基于网络的整合多组学分析的计算方法

Front Mol Biosci. 2022 Nov 14;9:967205. doi: 10.3389/fmolb.2022.967205. eCollection 2022.

Differences in composition of interdigital skin microbiota predict sheep and feet that develop footrot.指间皮肤微生物群落组成的差异可预测发生蹄病的羊只及其脚部。

Sci Rep. 2022 May 27;12(1):8931. doi: 10.1038/s41598-022-12772-7.

Discovery of Depression-Associated Factors From a Nationwide Population-Based Survey: Epidemiological Study Using Machine Learning and Network Analysis.基于全国性人口普查的抑郁相关因素的发现：使用机器学习和网络分析的流行病学研究。

J Med Internet Res. 2021 Jun 24;23(6):e27344. doi: 10.2196/27344.

Multiple model triangulation to identify factors associated with lameness in British sheep flocks.多模型三角剖分法鉴定与英国绵羊跛行相关的因素。

Prev Vet Med. 2021 Aug;193:105395. doi: 10.1016/j.prevetmed.2021.105395. Epub 2021 Jun 1.

Sci Rep. 2021 Jan 11;11(1):412. doi: 10.1038/s41598-020-79317-8.

Reanalysis datasets outperform other gridded climate products in vegetation change analysis in peripheral conservation areas of Central Asia.在中亚保护区周边植被变化分析中，重新分析数据集比其他网格化气候产品表现更好。

Sci Rep. 2020 Dec 31;10(1):22446. doi: 10.1038/s41598-020-79480-y.

本文引用的文献

Use of bootstrapped, regularised regression to identify factors associated with lamb-derived revenue on commercial sheep farms.使用自抽样、正则化回归来确定与商业养羊场羔羊收益相关的因素。

Prev Vet Med. 2020 Jan;174:104851. doi: 10.1016/j.prevetmed.2019.104851. Epub 2019 Nov 19.

It's time to talk about ditching statistical significance.是时候谈谈摒弃统计显著性了。

Nature. 2019 Mar;567(7748):283. doi: 10.1038/d41586-019-00874-8.

Robust research needs many lines of evidence.强有力的研究需要多方面的证据。

Nature. 2018 Jan 25;553(7689):399-401. doi: 10.1038/d41586-018-01023-3.

Sparsity Is Better with Stability: Combining Accuracy and Stability for Model Selection in Brain Decoding.稀疏性与稳定性更佳：在脑解码中结合准确性与稳定性进行模型选择

Front Neurosci. 2017 Feb 17;11:62. doi: 10.3389/fnins.2017.00062. eCollection 2017.

A selective overview of feature screening for ultrahigh-dimensional data.超高维数据特征筛选的选择性概述。

Sci China Math. 2015 Oct;58(10):2033-2054. doi: 10.1007/s11425-015-5062-9. Epub 2015 Aug 22.

Comparison of subset selection methods in linear regression in the context of health-related quality of life and substance abuse in Russia.俄罗斯健康相关生活质量与药物滥用背景下线性回归中子集选择方法的比较

BMC Med Res Methodol. 2015 Aug 30;15:71. doi: 10.1186/s12874-015-0066-2.

COORDINATE DESCENT ALGORITHMS FOR NONCONVEX PENALIZED REGRESSION, WITH APPLICATIONS TO BIOLOGICAL FEATURE SELECTION.用于非凸惩罚回归的坐标下降算法及其在生物特征选择中的应用

Ann Appl Stat. 2011 Jan 1;5(1):232-253. doi: 10.1214/10-AOAS388.

ON THE ADAPTIVE ELASTIC-NET WITH A DIVERGING NUMBER OF PARAMETERS.关于具有发散参数数量的自适应弹性网络

Ann Stat. 2009;37(4):1733-1751. doi: 10.1214/08-AOS625.

HIGH DIMENSIONAL VARIABLE SELECTION.高维变量选择

Ann Stat. 2009 Jan 1;37(5A):2178-2201. doi: 10.1214/08-aos646.

Scaling regression inputs by dividing by two standard deviations.通过除以两个标准差对回归输入进行缩放。

Stat Med. 2008 Jul 10;27(15):2865-73. doi: 10.1002/sim.3107.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

具有相对高维数据的推断模型的变量选择：稳健选择的辅助方法——方法异质性和协变量稳定性。

Variable selection for inferential models with relatively high-dimensional data: Between method heterogeneity and covariate stability as adjuncts to robust selection.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献