异构数据集中的变量选择：一种截断秩稀疏线性混合模型及其在全基因组关联研究中的应用

Variable Selection in Heterogeneous Datasets: A Truncated-rank Sparse Linear Mixed Model with Applications to Genome-wide Association Studies.

作者信息

Wang Haohan, Aragam Bryon, Xing Eric P

机构信息

Language Technologies Institute, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA.

Machine Learning Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA.

出版信息

Proceedings (IEEE Int Conf Bioinformatics Biomed). 2017 Nov;2017:431-438. doi: 10.1109/BIBM.2017.8217687. Epub 2017 Dec 18.

DOI:10.1109/BIBM.2017.8217687

PMID:29629235

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5889139/

Abstract

A fundamental and important challenge in modern datasets of ever increasing dimensionality is variable selection, which has taken on renewed interest recently due to the growth of biological and medical datasets with complex, non-i.i.d. structures. Naïvely applying classical variable selection methods such as the Lasso to such datasets may lead to a large number of false discoveries. Motivated by genome-wide association studies in genetics, we study the problem of variable selection for datasets arising from multiple subpopulations, when this underlying population structure is unknown to the researcher. We propose a unified framework for sparse variable selection that adaptively corrects for population structure via a low-rank linear mixed model. Most importantly, the proposed method does not require prior knowledge of individual relationships in the data and adaptively selects a covariance structure of the correct complexity. Through extensive experiments, we illustrate the effectiveness of this framework over existing methods. Further, we test our method on three different genomic datasets from plants, mice, and humans, and discuss the knowledge we discover with our model.

摘要

在维度不断增加的现代数据集中，一个基本且重要的挑战是变量选择。由于具有复杂、非独立同分布结构的生物和医学数据集的增长，变量选择最近重新引起了人们的关注。简单地将诸如套索（Lasso）等经典变量选择方法应用于此类数据集可能会导致大量错误发现。受遗传学中全基因组关联研究的启发，我们研究当研究人员不知道潜在总体结构时，来自多个亚群的数据集的变量选择问题。我们提出了一个统一的稀疏变量选择框架，该框架通过低秩线性混合模型自适应地校正总体结构。最重要的是，所提出的方法不需要数据中个体关系的先验知识，并能自适应地选择具有正确复杂度的协方差结构。通过广泛的实验，我们说明了该框架相对于现有方法的有效性。此外，我们在来自植物、小鼠和人类的三个不同基因组数据集上测试了我们的方法，并讨论了我们通过模型发现的知识。

相似文献

Variable Selection in Heterogeneous Datasets: A Truncated-rank Sparse Linear Mixed Model with Applications to Genome-wide Association Studies.异构数据集中的变量选择：一种截断秩稀疏线性混合模型及其在全基因组关联研究中的应用

Proceedings (IEEE Int Conf Bioinformatics Biomed). 2017 Nov;2017:431-438. doi: 10.1109/BIBM.2017.8217687. Epub 2017 Dec 18.

Variable selection in heterogeneous datasets: A truncated-rank sparse linear mixed model with applications to genome-wide association studies.异质数据集的变量选择：截断秩稀疏线性混合模型及其在全基因组关联研究中的应用。

Methods. 2018 Aug 1;145:2-9. doi: 10.1016/j.ymeth.2018.04.021. Epub 2018 Apr 27.

Fast and efficient correction for population stratification in multi-locus genome-wide association studies.多基因座全基因组关联研究中人群分层的快速高效校正。

Genetica. 2021 Dec;149(5-6):313-325. doi: 10.1007/s10709-021-00129-3. Epub 2021 Sep 4.

Learning mixed graphical models with separate sparsity parameters and stability-based model selection.学习具有单独稀疏参数和基于稳定性的模型选择的混合图形模型。

BMC Bioinformatics. 2016 Jun 6;17 Suppl 5(Suppl 5):175. doi: 10.1186/s12859-016-1039-0.

Part 1. Statistical Learning Methods for the Effects of Multiple Air Pollution Constituents.第1部分. 多种空气污染成分影响的统计学习方法

Res Rep Health Eff Inst. 2015 Jun(183 Pt 1-2):5-50.

Precision Lasso: accounting for correlations and linear dependencies in high-dimensional genomic data.精准套索：在高维基因组数据中考虑相关性和线性依赖关系。

Bioinformatics. 2019 Apr 1;35(7):1181-1187. doi: 10.1093/bioinformatics/bty750.

Statistical integration of two omics datasets using GO2PLS.使用GO2PLS对两个组学数据集进行统计整合。

BMC Bioinformatics. 2021 Mar 18;22(1):131. doi: 10.1186/s12859-021-03958-3.

Sparse latent factor regression models for genome-wide and epigenome-wide association studies.用于全基因组和表观基因组关联研究的稀疏潜在因子回归模型。

Stat Appl Genet Mol Biol. 2022 Mar 7;21(1):sagmb-2021-0035. doi: 10.1515/sagmb-2021-0035.

Combining Sparse Group Lasso and Linear Mixed Model Improves Power to Detect Genetic Variants Underlying Quantitative Traits.结合稀疏组套索和线性混合模型可提高检测数量性状潜在遗传变异的效能。

Front Genet. 2019 Apr 10;10:271. doi: 10.3389/fgene.2019.00271. eCollection 2019.

Regularized multi-trait multi-locus linear mixed models for genome-wide association studies and genomic selection in crops.作物全基因组关联研究和基因组选择的正则化多性状多基因座线性混合模型。

BMC Bioinformatics. 2023 Oct 26;24(1):399. doi: 10.1186/s12859-023-05519-2.

引用本文的文献

FedGMMAT: Federated generalized linear mixed model association tests.FedGMMAT：联邦广义线性混合模型关联测试。

PLoS Comput Biol. 2024 Jul 24;20(7):e1012142. doi: 10.1371/journal.pcbi.1012142. eCollection 2024 Jul.

Deep Learning for Genomics: From Early Neural Nets to Modern Large Language Models.深度学习在基因组学中的应用：从早期神经网络到现代大型语言模型。

Int J Mol Sci. 2023 Nov 1;24(21):15858. doi: 10.3390/ijms242115858.

Trade-offs of Linear Mixed Models in Genome-Wide Association Studies.全基因组关联研究中线性混合模型的权衡

J Comput Biol. 2022 Mar;29(3):233-242. doi: 10.1089/cmb.2021.0157. Epub 2022 Feb 25.

Sparse Regression in Cancer Genomics: Comparing Variable Selection and Predictions in Real World Data.癌症基因组学中的稀疏回归：在真实世界数据中比较变量选择和预测

Cancer Inform. 2021 Nov 27;20:11769351211056298. doi: 10.1177/11769351211056298. eCollection 2021.

Coupled mixed model for joint genetic analysis of complex disorders with two independently collected data sets.用于对具有两个独立收集数据集的复杂疾病进行联合遗传分析的耦合混合模型。

BMC Bioinformatics. 2021 Feb 5;22(1):50. doi: 10.1186/s12859-021-03959-2.

Discovering weaker genetic associations guided by known associations.根据已知关联发现较弱的遗传关联。

BMC Med Genomics. 2020 Feb 24;13(Suppl 3):19. doi: 10.1186/s12920-020-0667-4.

Long noncoding RNA LINC00341 promotes the vascular smooth muscle cells proliferation and migration via miR-214/FOXO4 feedback loop.长链非编码RNA LINC00341通过miR-214/FOXO4反馈环促进血管平滑肌细胞增殖和迁移。

Am J Transl Res. 2019 Mar 15;11(3):1835-1842. eCollection 2019.

Transition-transversion encoding and genetic relationship metric in ReliefF feature selection improves pathway enrichment in GWAS.ReliefF特征选择中的转换-颠换编码和遗传关系度量可改善全基因组关联研究中的通路富集。

BioData Min. 2018 Nov 3;11:23. doi: 10.1186/s13040-018-0186-4. eCollection 2018.

In Search of Biomarkers for Pathogenesis and Control of Leishmaniasis by Global Analyses of -Infected Macrophages.通过对感染巨噬细胞的全球分析寻找利什曼病发病机制和控制的生物标志物。

Front Cell Infect Microbiol. 2018 Sep 19;8:326. doi: 10.3389/fcimb.2018.00326. eCollection 2018.

Multiplex confounding factor correction for genomic association mapping with squared sparse linear mixed model.基于二次稀疏线性混合模型的基因组关联作图的多元混杂因素校正。

Methods. 2018 Aug 1;145:33-40. doi: 10.1016/j.ymeth.2018.04.020. Epub 2018 Apr 27.

本文引用的文献

Precision Lasso: accounting for correlations and linear dependencies in high-dimensional genomic data.精准套索：在高维基因组数据中考虑相关性和线性依赖关系。

Bioinformatics. 2019 Apr 1;35(7):1181-1187. doi: 10.1093/bioinformatics/bty750.

VARIABLE SELECTION IN LINEAR MIXED EFFECTS MODELS.线性混合效应模型中的变量选择

Ann Stat. 2012 Aug 1;40(4):2043-2068. doi: 10.1214/12-AOS1028.

Efficient multivariate linear mixed model algorithms for genome-wide association studies.高效的全基因组关联研究的多元线性混合模型算法。

Nat Methods. 2014 Apr;11(4):407-9. doi: 10.1038/nmeth.2848. Epub 2014 Feb 16.

Correcting for population structure and kinship using the linear mixed model: theory and extensions.使用线性混合模型校正群体结构和亲缘关系：理论与扩展。

PLoS One. 2013 Oct 28;8(10):e75707. doi: 10.1371/journal.pone.0075707. eCollection 2013.

The advantages and limitations of trait analysis with GWAS: a review.GWAS 中特质分析的优势与局限性：综述。

Plant Methods. 2013 Jul 22;9:29. doi: 10.1186/1746-4811-9-29. eCollection 2013.

Genes and pathways underlying regional and cell type changes in Alzheimer's disease.阿尔茨海默病中区域和细胞类型变化的相关基因和途径。

Genome Med. 2013 May 25;5(5):48. doi: 10.1186/gm452. eCollection 2013.

Integrated systems approach identifies genetic nodes and networks in late-onset Alzheimer's disease.综合系统方法鉴定了迟发性阿尔茨海默病中的遗传节点和网络。

Cell. 2013 Apr 25;153(3):707-20. doi: 10.1016/j.cell.2013.03.030.

Variable selection for multiply-imputed data with application to dioxin exposure study.具有应用于二恶英暴露研究的多重插补数据的变量选择。

Stat Med. 2013 Sep 20;32(21):3646-59. doi: 10.1002/sim.5783. Epub 2013 Mar 25.

Apolipoprotein E and Alzheimer disease: risk, mechanisms and therapy.载脂蛋白 E 与阿尔茨海默病：风险、机制与治疗。

Nat Rev Neurol. 2013 Feb;9(2):106-18. doi: 10.1038/nrneurol.2012.263. Epub 2013 Jan 8.

A Lasso multi-marker mixed model for association mapping with population structure correction.带有群体结构校正的关联作图的套索多标记混合模型。

Bioinformatics. 2013 Jan 15;29(2):206-14. doi: 10.1093/bioinformatics/bts669. Epub 2012 Nov 22.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验