Ouyang Zhongzhe, Wang Lu
Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA.
Mathematics (Basel). 2024 Apr;12(7). doi: 10.3390/math12070951. Epub 2024 Mar 23.
When integrating data from multiple sources, a common challenge is block-wise missing. Most existing methods address this issue only in cross-sectional studies. In this paper, we propose a method for variable selection when combining datasets from multiple sources in longitudinal studies. To account for block-wise missing in covariates, we impute the missing values multiple times based on combinations of samples from different missing pattern and predictors from different data sources. We then use these imputed data to construct estimating equations, and aggregate the information across subjects and sources with the generalized method of moments. We employ the smoothly clipped absolute deviation penalty in variable selection and use the extended Bayesian Information Criterion criteria for tuning parameter selection. We establish the asymptotic properties of the proposed estimator, and demonstrate the superior performance of the proposed method through numerical experiments. Furthermore, we apply the proposed method in the Alzheimer's Disease Neuroimaging Initiative study to identify sensitive early-stage biomarkers of Alzheimer's Disease, which is crucial for early disease detection and personalized treatment.
在整合来自多个来源的数据时,一个常见的挑战是分块缺失。大多数现有方法仅在横断面研究中解决这个问题。在本文中,我们提出了一种在纵向研究中合并多个来源数据集时进行变量选择的方法。为了解决协变量中的分块缺失问题,我们基于来自不同缺失模式的样本组合和来自不同数据源的预测变量多次插补缺失值。然后,我们使用这些插补数据构建估计方程,并通过广义矩方法汇总跨个体和数据源的信息。我们在变量选择中采用平滑截断绝对偏差惩罚,并使用扩展贝叶斯信息准则进行调优参数选择。我们建立了所提出估计量的渐近性质,并通过数值实验证明了所提出方法的优越性能。此外,我们将所提出的方法应用于阿尔茨海默病神经影像学倡议研究中,以识别阿尔茨海默病的敏感早期生物标志物,这对于疾病的早期检测和个性化治疗至关重要。