Suppr超能文献

异构数据集中的变量选择:一种截断秩稀疏线性混合模型及其在全基因组关联研究中的应用

Variable Selection in Heterogeneous Datasets: A Truncated-rank Sparse Linear Mixed Model with Applications to Genome-wide Association Studies.

作者信息

Wang Haohan, Aragam Bryon, Xing Eric P

机构信息

Language Technologies Institute, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA.

Machine Learning Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA.

出版信息

Proceedings (IEEE Int Conf Bioinformatics Biomed). 2017 Nov;2017:431-438. doi: 10.1109/BIBM.2017.8217687. Epub 2017 Dec 18.

Abstract

A fundamental and important challenge in modern datasets of ever increasing dimensionality is variable selection, which has taken on renewed interest recently due to the growth of biological and medical datasets with complex, non-i.i.d. structures. Naïvely applying classical variable selection methods such as the Lasso to such datasets may lead to a large number of false discoveries. Motivated by genome-wide association studies in genetics, we study the problem of variable selection for datasets arising from multiple subpopulations, when this underlying population structure is unknown to the researcher. We propose a unified framework for sparse variable selection that adaptively corrects for population structure via a low-rank linear mixed model. Most importantly, the proposed method does not require prior knowledge of individual relationships in the data and adaptively selects a covariance structure of the correct complexity. Through extensive experiments, we illustrate the effectiveness of this framework over existing methods. Further, we test our method on three different genomic datasets from plants, mice, and humans, and discuss the knowledge we discover with our model.

摘要

在维度不断增加的现代数据集中,一个基本且重要的挑战是变量选择。由于具有复杂、非独立同分布结构的生物和医学数据集的增长,变量选择最近重新引起了人们的关注。简单地将诸如套索(Lasso)等经典变量选择方法应用于此类数据集可能会导致大量错误发现。受遗传学中全基因组关联研究的启发,我们研究当研究人员不知道潜在总体结构时,来自多个亚群的数据集的变量选择问题。我们提出了一个统一的稀疏变量选择框架,该框架通过低秩线性混合模型自适应地校正总体结构。最重要的是,所提出的方法不需要数据中个体关系的先验知识,并能自适应地选择具有正确复杂度的协方差结构。通过广泛的实验,我们说明了该框架相对于现有方法的有效性。此外,我们在来自植物、小鼠和人类的三个不同基因组数据集上测试了我们的方法,并讨论了我们通过模型发现的知识。

相似文献

引用本文的文献

1
FedGMMAT: Federated generalized linear mixed model association tests.FedGMMAT:联邦广义线性混合模型关联测试。
PLoS Comput Biol. 2024 Jul 24;20(7):e1012142. doi: 10.1371/journal.pcbi.1012142. eCollection 2024 Jul.
3
Trade-offs of Linear Mixed Models in Genome-Wide Association Studies.全基因组关联研究中线性混合模型的权衡
J Comput Biol. 2022 Mar;29(3):233-242. doi: 10.1089/cmb.2021.0157. Epub 2022 Feb 25.
6

本文引用的文献

2
VARIABLE SELECTION IN LINEAR MIXED EFFECTS MODELS.线性混合效应模型中的变量选择
Ann Stat. 2012 Aug 1;40(4):2043-2068. doi: 10.1214/12-AOS1028.

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验