Xu Hui, Gu Xiangdong, Tadesse Mahlet G, Balasubramanian Raji
Department of Biostatistics and Epidemiology, University of Massachusetts Amherst, Amherst, MA 01003.
Department of Mathematics and Statistics Georgetown University, Washington, DC 20057.
J Comput Graph Stat. 2018;27(4):763-772. doi: 10.1080/10618600.2018.1474115. Epub 2018 Aug 20.
We present an ensemble tree-based algorithm for variable selection in high dimensional datasets, in settings where a time-to-event outcome is observed with error. The proposed methods are motivated by self-reported outcomes collected in large-scale epidemiologic studies, such as the Women's Health Initiative. The proposed methods equally apply to imperfect outcomes that arise in other settings such as data extracted from electronic medical records. To evaluate the performance of our proposed algorithm, we present results from simulation studies, considering both continuous and categorical covariates. We illustrate this approach to discover single nucleotide polymorphisms that are associated with incident Type II diabetes in the Women's Health Initiative. A freely available R package (R Core Team, 2018; Xu et al., 2018) has been developed to implement the proposed methods.
我们提出了一种基于集成树的算法,用于在高维数据集中进行变量选择,该数据集的事件发生时间结果存在观测误差。所提出的方法是受大规模流行病学研究(如妇女健康倡议)中收集的自我报告结果所启发。所提出的方法同样适用于其他场景中出现的不完美结果,例如从电子病历中提取的数据。为了评估我们提出的算法的性能,我们给出了模拟研究的结果,同时考虑了连续和分类协变量。我们展示了这种方法在妇女健康倡议中发现与II型糖尿病发病相关的单核苷酸多态性的应用。我们已经开发了一个免费的R包(R核心团队,2018;徐等人,2018)来实现所提出的方法。