使用BGLR R包对生物样本库规模的数据进行快速分析和荟萃分析。

Fast analysis of biobank-size data and meta-analysis using the BGLR R-package.

作者信息

Pérez-Rodríguez Paulino, de Los Campos Gustavo, Wu Hao, Vazquez Ana I, Jones Kyle

机构信息

Colegio de Postgraduados, Montecillo, Estado de México 56230, México.

Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, MI 48824, USA.

出版信息

G3 (Bethesda). 2025 Apr 17;15(4). doi: 10.1093/g3journal/jkae288.

DOI:10.1093/g3journal/jkae288

PMID:39657738

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12005161/

Abstract

Analyzing human genomic data from biobanks and large-scale genetic evaluations often requires fitting models with a sample size exceeding the number of DNA markers used (n>p). For instance, developing polygenic scores for humans and genomic prediction for genetic evaluations of agricultural species may require fitting models involving a few thousand SNPs using data with hundreds of thousands of samples. In such cases, computations based on sufficient statistics are more efficient than those based on individual genotype-phenotype data. Additionally, software that admits sufficient statistics as inputs can be used to analyze data from multiple sources jointly without the need to share individual genotype-phenotype data. Therefore, we developed functionality within the BGLR R-package that generates posterior samples for Bayesian shrinkage and variable selection models from sufficient statistics. In this article, we present an overview of the new methods incorporated in the BGLR R-package, demonstrate the use of the new software through simple examples, provide several computational benchmarks, and present a real-data example using data from the UK-Biobank, All of Us, and the Hispanic Community Health Study/Study of Latinos cohort demonstrating how a joint analysis from multiple cohorts can be implemented without sharing individual genotype-phenotype data, and how a combined analysis can improve the prediction accuracy of polygenic scores for Hispanics-a group severely under-represented in genome-wide association studies data.

摘要

分析生物样本库中的人类基因组数据以及进行大规模基因评估通常需要拟合样本量超过所用DNA标记数量的模型（n>p）。例如，开发人类多基因评分以及对农业物种进行基因评估的基因组预测，可能需要使用包含数十万样本的数据来拟合涉及数千个单核苷酸多态性（SNP）的模型。在这种情况下，基于充分统计量的计算比基于个体基因型-表型数据的计算更有效。此外，允许将充分统计量作为输入的软件可用于联合分析来自多个来源的数据，而无需共享个体基因型-表型数据。因此，我们在BGLR R包中开发了相关功能，可根据充分统计量为贝叶斯收缩和变量选择模型生成后验样本。在本文中，我们概述了BGLR R包中纳入的新方法，通过简单示例演示新软件的使用，提供几个计算基准，并给出一个使用来自英国生物样本库、“我们所有人”项目以及西班牙裔社区健康研究/拉丁裔研究队列数据的真实数据示例，展示了如何在不共享个体基因型-表型数据的情况下对多个队列进行联合分析，以及联合分析如何提高西班牙裔多基因评分的预测准确性——该群体在全基因组关联研究数据中代表性严重不足。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b71b/12005161/426f047e3e80/jkae288f1.jpg

相似文献

Fast analysis of biobank-size data and meta-analysis using the BGLR R-package.

G3 (Bethesda). 2025 Apr 17;15(4). doi: 10.1093/g3journal/jkae288.

Multitrait Bayesian shrinkage and variable selection models with the BGLR-R package.

Genetics. 2022 Aug 30;222(1). doi: 10.1093/genetics/iyac112.

Genome-wide regression and prediction with the BGLR statistical package.

Genetics. 2014 Oct;198(2):483-95. doi: 10.1534/genetics.114.164442. Epub 2014 Jul 9.

Efficient Implementation of Penalized Regression for Genetic Risk Prediction.

Genetics. 2019 May;212(1):65-74. doi: 10.1534/genetics.119.302019. Epub 2019 Feb 26.

Fine mapping and accurate prediction of complex traits using Bayesian Variable Selection models applied to biobank-size data.

Eur J Hum Genet. 2023 Mar;31(3):313-320. doi: 10.1038/s41431-022-01135-5. Epub 2022 Jul 19.

Global Biobank Engine: enabling genotype-phenotype browsing for biobank summary statistics.

Bioinformatics. 2019 Jul 15;35(14):2495-2497. doi: 10.1093/bioinformatics/bty999.

Improved polygenic prediction by Bayesian multiple regression on summary statistics.

Nat Commun. 2019 Nov 8;10(1):5086. doi: 10.1038/s41467-019-12653-0.

Improving GWAS discovery and genomic prediction accuracy in biobank data.

Proc Natl Acad Sci U S A. 2022 Aug 2;119(31):e2121279119. doi: 10.1073/pnas.2121279119. Epub 2022 Jul 29.

GbyE: an integrated tool for genome widely association study and genome selection based on genetic by environmental interaction.

BMC Genomics. 2024 Apr 19;25(1):386. doi: 10.1186/s12864-024-10310-5.

A fast and scalable framework for large-scale and ultrahigh-dimensional sparse regression with application to the UK Biobank.

PLoS Genet. 2020 Oct 23;16(10):e1009141. doi: 10.1371/journal.pgen.1009141. eCollection 2020 Oct.

引用本文的文献

A Framework Integrating GWAS and Genomic Selection to Enhance Prediction Accuracy of Economical Traits in Common Carp.

Int J Mol Sci. 2025 Jul 21;26(14):7009. doi: 10.3390/ijms26147009.

Biobanks in GENETICS and G3: tackling the statistical challenges.

Genetics. 2025 Apr 17;229(4). doi: 10.1093/genetics/iyaf046.

Biobanks in GENETICS and G3: tackling the statistical challenges.

G3 (Bethesda). 2025 Apr 17;15(4). doi: 10.1093/g3journal/jkaf060.

本文引用的文献

Bayesian hierarchical hypothesis testing in large-scale genome-wide association analysis.

Genetics. 2024 Nov 19;228(4). doi: 10.1093/genetics/iyae164.

A simple new approach to variable selection in regression, with application to genetic fine mapping.

J R Stat Soc Series B Stat Methodol. 2020 Dec;82(5):1273-1300. doi: 10.1111/rssb.12388. Epub 2020 Jul 10.

The construction of cross-population polygenic risk scores using transfer learning.

Am J Hum Genet. 2022 Nov 3;109(11):1998-2008. doi: 10.1016/j.ajhg.2022.09.010. Epub 2022 Oct 13.

A saturated map of common genetic variants associated with human height.

Nature. 2022 Oct;610(7933):704-712. doi: 10.1038/s41586-022-05275-y. Epub 2022 Oct 12.

Multitrait Bayesian shrinkage and variable selection models with the BGLR-R package.

Genetics. 2022 Aug 30;222(1). doi: 10.1093/genetics/iyac112.

Fine mapping and accurate prediction of complex traits using Bayesian Variable Selection models applied to biobank-size data.

Eur J Hum Genet. 2023 Mar;31(3):313-320. doi: 10.1038/s41431-022-01135-5. Epub 2022 Jul 19.

Large uncertainty in individual polygenic risk score estimation impacts PRS-based risk stratification.

Nat Genet. 2022 Jan;54(1):30-39. doi: 10.1038/s41588-021-00961-5. Epub 2021 Dec 20.

MegaLMM: Mega-scale linear mixed models for genomic predictions with thousands of traits.

Genome Biol. 2021 Jul 23;22(1):213. doi: 10.1186/s13059-021-02416-w.

LDpred2: better, faster, stronger.

Bioinformatics. 2021 Apr 1;36(22-23):5424-5431. doi: 10.1093/bioinformatics/btaa1029.

Accurate and Scalable Construction of Polygenic Scores in Large Biobank Data Sets.

Am J Hum Genet. 2020 May 7;106(5):679-693. doi: 10.1016/j.ajhg.2020.03.013. Epub 2020 Apr 23.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

使用BGLR R包对生物样本库规模的数据进行快速分析和荟萃分析。

Fast analysis of biobank-size data and meta-analysis using the BGLR R-package.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献