Ye Shangyuan, Yu Tingting, Caroff Daniel A, Huang Susan S, Zhang Bo, Wang Rui
Biostatistics Shared Resource, Knight Cancer Institute, Oregon Health & Science University, Oregon, U.S.A.
Department of Population Medicine, Harvard Pilgrim Health Care Institute and Harvard Medical School, Massachusetts, U.S.A.
Can J Stat. 2025 Mar;53(1). doi: 10.1002/cjs.11824. Epub 2024 Aug 1.
In many biomedical applications, there is a need to build risk-adjustment models based on clustered data. However, methods for variable selection that are applicable to clustered discrete data settings with a large number of candidate variables and potentially large cluster sizes are lacking. We develop a new variable selection approach that combines within-cluster resampling techniques with penalized likelihood methods to select variables for high-dimensional clustered data. We derive an upper bound on the expected number of falsely selected variables, demonstrate the oracle properties of the proposed method, and evaluate the finite sample performance of the method through extensive simulations. We illustrate the proposed approach using a colon surgical site infection data set consisting of 39,468 individuals from 149 hospitals to build risk-adjustment models that account for both the main effects of various risk factors and their two-way interactions.
在许多生物医学应用中,需要基于聚类数据构建风险调整模型。然而,适用于具有大量候选变量和潜在大聚类规模的聚类离散数据设置的变量选择方法却很缺乏。我们开发了一种新的变量选择方法,该方法将聚类内重采样技术与惩罚似然方法相结合,用于为高维聚类数据选择变量。我们推导了错误选择变量预期数量的上界,证明了所提方法的神谕性质,并通过广泛的模拟评估了该方法的有限样本性能。我们使用一个包含来自149家医院的39468名个体的结肠手术部位感染数据集来说明所提方法,以构建考虑各种风险因素的主效应及其双向交互作用的风险调整模型。