Elliott Corrine F, Lambert Joshua W, Stromberg Arnold J, Wang Pei, Zeng Ting, Thompson Katherine L
Department of Statistics, University of Kentucky, Lexington, KY, USA.
College of Nursing, University of Cincinnati, Cincinnati, OH, USA.
J Appl Stat. 2020 Jun 29;48(11):2022-2041. doi: 10.1080/02664763.2020.1783522. eCollection 2021.
As new technologies permit the generation of hitherto unprecedented volumes of data (e.g. genome-wide association study data), researchers struggle to keep up with the added complexity and time commitment required for its analysis. For this reason, model selection commonly relies on machine learning and data-reduction techniques, which tend to afford models with obscure interpretations. Even in cases with straightforward explanatory variables, the so-called 'best' model produced by a given model-selection technique may fail to capture information of vital importance to the domain-specific questions at hand. Herein we propose a new concept for model selection, , for use in identifying multiple models that are in some sense optimal and may unite to provide a wider range of information relevant to the topic of interest, including (but not limited to) interaction terms. We further provide an R package and associated Shiny Applications for use in identifying or validating feasible models, the performance of which we demonstrate on both simulated and real-life data.
随着新技术使得能够生成前所未有的大量数据(例如全基因组关联研究数据),研究人员难以跟上其分析所需的增加的复杂性和时间投入。因此,模型选择通常依赖于机器学习和数据缩减技术,而这些技术往往会产生难以解释的模型。即使在解释变量简单明了的情况下,给定模型选择技术产生的所谓“最佳”模型也可能无法捕捉到手头特定领域问题至关重要的信息。在此,我们提出了一种新的模型选择概念,用于识别在某种意义上最优且可能结合起来提供与感兴趣主题相关的更广泛信息(包括但不限于交互项)的多个模型。我们还提供了一个R包和相关的Shiny应用程序,用于识别或验证可行模型,并在模拟数据和实际数据上展示了其性能。