Wang Haohan, Aragam Bryon, Xing Eric P
Language Technologies Institute, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA.
Machine Learning Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA.
Proceedings (IEEE Int Conf Bioinformatics Biomed). 2017 Nov;2017:431-438. doi: 10.1109/BIBM.2017.8217687. Epub 2017 Dec 18.
A fundamental and important challenge in modern datasets of ever increasing dimensionality is variable selection, which has taken on renewed interest recently due to the growth of biological and medical datasets with complex, non-i.i.d. structures. Naïvely applying classical variable selection methods such as the Lasso to such datasets may lead to a large number of false discoveries. Motivated by genome-wide association studies in genetics, we study the problem of variable selection for datasets arising from multiple subpopulations, when this underlying population structure is unknown to the researcher. We propose a unified framework for sparse variable selection that adaptively corrects for population structure via a low-rank linear mixed model. Most importantly, the proposed method does not require prior knowledge of individual relationships in the data and adaptively selects a covariance structure of the correct complexity. Through extensive experiments, we illustrate the effectiveness of this framework over existing methods. Further, we test our method on three different genomic datasets from plants, mice, and humans, and discuss the knowledge we discover with our model.
在维度不断增加的现代数据集中,一个基本且重要的挑战是变量选择。由于具有复杂、非独立同分布结构的生物和医学数据集的增长,变量选择最近重新引起了人们的关注。简单地将诸如套索(Lasso)等经典变量选择方法应用于此类数据集可能会导致大量错误发现。受遗传学中全基因组关联研究的启发,我们研究当研究人员不知道潜在总体结构时,来自多个亚群的数据集的变量选择问题。我们提出了一个统一的稀疏变量选择框架,该框架通过低秩线性混合模型自适应地校正总体结构。最重要的是,所提出的方法不需要数据中个体关系的先验知识,并能自适应地选择具有正确复杂度的协方差结构。通过广泛的实验,我们说明了该框架相对于现有方法的有效性。此外,我们在来自植物、小鼠和人类的三个不同基因组数据集上测试了我们的方法,并讨论了我们通过模型发现的知识。