Suppr超能文献

一种快速且可扩展的大规模超高维稀疏回归框架及其在 UK Biobank 中的应用。

A fast and scalable framework for large-scale and ultrahigh-dimensional sparse regression with application to the UK Biobank.

机构信息

Department of Statistics, Stanford University, Stanford, CA, United States of America.

Department of Biomedical Data Science, Stanford University, Stanford, CA, United States of America.

出版信息

PLoS Genet. 2020 Oct 23;16(10):e1009141. doi: 10.1371/journal.pgen.1009141. eCollection 2020 Oct.

Abstract

The UK Biobank is a very large, prospective population-based cohort study across the United Kingdom. It provides unprecedented opportunities for researchers to investigate the relationship between genotypic information and phenotypes of interest. Multiple regression methods, compared with genome-wide association studies (GWAS), have already been showed to greatly improve the prediction performance for a variety of phenotypes. In the high-dimensional settings, the lasso, since its first proposal in statistics, has been proved to be an effective method for simultaneous variable selection and estimation. However, the large-scale and ultrahigh dimension seen in the UK Biobank pose new challenges for applying the lasso method, as many existing algorithms and their implementations are not scalable to large applications. In this paper, we propose a computational framework called batch screening iterative lasso (BASIL) that can take advantage of any existing lasso solver and easily build a scalable solution for very large data, including those that are larger than the memory size. We introduce snpnet, an R package that implements the proposed algorithm on top of glmnet and optimizes for single nucleotide polymorphism (SNP) datasets. It currently supports ℓ1-penalized linear model, logistic regression, Cox model, and also extends to the elastic net with ℓ1/ℓ2 penalty. We demonstrate results on the UK Biobank dataset, where we achieve competitive predictive performance for all four phenotypes considered (height, body mass index, asthma, high cholesterol) using only a small fraction of the variants compared with other established polygenic risk score methods.

摘要

英国生物银行是一项在英国开展的大型前瞻性基于人群的队列研究。它为研究人员提供了前所未有的机会,以研究基因型信息与感兴趣的表型之间的关系。与全基因组关联研究(GWAS)相比,多元回归方法已经被证明可以极大地提高各种表型的预测性能。在高维环境中,自首次在统计学中提出以来,套索已被证明是一种同时进行变量选择和估计的有效方法。然而,英国生物银行中所见的大规模和超高维性给应用套索方法带来了新的挑战,因为许多现有的算法及其实现都不适用于大规模应用。在本文中,我们提出了一种称为批量筛选迭代套索(BASIL)的计算框架,该框架可以利用任何现有的套索求解器,并轻松为非常大的数据构建可扩展的解决方案,包括那些大于内存大小的数据。我们引入了 snpnet,这是一个在 glmnet 之上实现所提出算法的 R 包,并针对单核苷酸多态性(SNP)数据集进行了优化。它目前支持ℓ1-惩罚线性模型、逻辑回归、Cox 模型,并且还扩展到具有ℓ1/ℓ2惩罚的弹性网络。我们在英国生物银行数据集上展示了结果,其中我们使用与其他已建立的多基因风险评分方法相比,仅使用一小部分变体就实现了所有四种考虑的表型(身高、体重指数、哮喘、高胆固醇)的竞争预测性能。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7168/7641476/3951e3f98c05/pgen.1009141.g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验