Buch Gregor, Schulz Andreas, Schmidtmann Irene, Strauch Konstantin, Wild Philipp S
Preventive Cardiology and Preventive Medicine, Department of Cardiology, University Medical Center of the Johannes Gutenberg University Mainz, Mainz, Germany.
German Center for Cardiovascular Research (DZHK), partner site Rhine-Main, Mainz, Germany.
Stat Med. 2023 Feb 10;42(3):331-352. doi: 10.1002/sim.9620. Epub 2022 Dec 22.
This review condenses the knowledge on variable selection methods implemented in R and appropriate for datasets with grouped features. The focus is on regularized regressions identified through a systematic review of the literature, following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines. A total of 14 methods are discussed, most of which use penalty terms to perform group variable selection. Depending on how the methods account for the group structure, they can be classified into knowledge and data-driven approaches. The first encompass group-level and bi-level selection methods, while two-step approaches and collinearity-tolerant methods constitute the second category. The identified methods are briefly explained and their performance compared in a simulation study. This comparison demonstrated that group-level selection methods, such as the group minimax concave penalty, are superior to other methods in selecting relevant variable groups but are inferior in identifying important individual variables in scenarios where not all variables in the groups are predictive. This can be better achieved by bi-level selection methods such as group bridge. Two-step and collinearity-tolerant approaches such as elastic net and ordered homogeneity pursuit least absolute shrinkage and selection operator are inferior to knowledge-driven methods but provide results without requiring prior knowledge. Possible applications in proteomics are considered, leading to suggestions on which method to use depending on existing prior knowledge and research question.
本综述总结了R语言中实现的、适用于具有分组特征数据集的变量选择方法的相关知识。重点是通过遵循系统评价与Meta分析的首选报告项目指南,对文献进行系统评价后确定的正则化回归方法。共讨论了14种方法,其中大多数使用惩罚项来进行组变量选择。根据这些方法处理组结构的方式,可将它们分为基于知识和数据驱动的方法。前者包括组水平和双水平选择方法,而两步法和耐共线性方法构成了后者。对所确定的方法进行了简要解释,并在模拟研究中比较了它们的性能。这种比较表明,组水平选择方法,如组最小最大凹惩罚法,在选择相关变量组方面优于其他方法,但在并非组内所有变量都具有预测性的情况下,识别重要个体变量的能力较差。这可以通过双水平选择方法,如组桥法,来更好地实现。两步法和耐共线性方法,如弹性网络和有序同质性追踪最小绝对收缩和选择算子,不如基于知识的方法,但无需先验知识即可提供结果。文中考虑了蛋白质组学中的可能应用,并根据现有先验知识和研究问题,对使用哪种方法提出了建议。