IGESS：一种在全基因组关联研究中整合个体水平基因型数据和汇总统计数据的统计方法。

IGESS: a statistical approach to integrating individual-level genotype data and summary statistics in genome-wide association studies.

机构信息

School of Mathematics and Statistics, Xi'an Jiaotong University, Xi'an, China.

Department of Mathematics, Hong Kong Baptist University, Hong Kong.

出版信息

Bioinformatics. 2017 Sep 15;33(18):2882-2889. doi: 10.1093/bioinformatics/btx314.

DOI:10.1093/bioinformatics/btx314

PMID:28498950

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5860575/

Abstract

MOTIVATION

Results from genome-wide association studies (GWAS) suggest that a complex phenotype is often affected by many variants with small effects, known as 'polygenicity'. Tens of thousands of samples are often required to ensure statistical power of identifying these variants with small effects. However, it is often the case that a research group can only get approval for the access to individual-level genotype data with a limited sample size (e.g. a few hundreds or thousands). Meanwhile, summary statistics generated using single-variant-based analysis are becoming publicly available. The sample sizes associated with the summary statistics datasets are usually quite large. How to make the most efficient use of existing abundant data resources largely remains an open question.

RESULTS

In this study, we propose a statistical approach, IGESS, to increasing statistical power of identifying risk variants and improving accuracy of risk prediction by i ntegrating individual level ge notype data and s ummary s tatistics. An efficient algorithm based on variational inference is developed to handle the genome-wide analysis. Through comprehensive simulation studies, we demonstrated the advantages of IGESS over the methods which take either individual-level data or summary statistics data as input. We applied IGESS to perform integrative analysis of Crohns Disease from WTCCC and summary statistics from other studies. IGESS was able to significantly increase the statistical power of identifying risk variants and improve the risk prediction accuracy from 63.2% ( ±0.4% ) to 69.4% ( ±0.1% ) using about 240 000 variants.

AVAILABILITY AND IMPLEMENTATION

The IGESS software is available at https://github.com/daviddaigithub/IGESS .

CONTACT

zbxu@xjtu.edu.cn or xwan@comp.hkbu.edu.hk or eeyang@hkbu.edu.hk.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

全基因组关联研究（GWAS）的结果表明，复杂的表型通常受到许多具有小效应的变体的影响，这些变体被称为“多效性”。为了确保识别这些具有小效应的变体的统计能力，通常需要成千上万的样本。然而，研究小组通常只能获得访问个体水平基因型数据的批准，而样本量有限（例如几百或几千个）。同时，基于单变量分析生成的汇总统计数据正变得越来越公开。与汇总统计数据集相关的样本量通常相当大。如何最有效地利用现有的丰富数据资源在很大程度上仍然是一个悬而未决的问题。

结果

在这项研究中，我们提出了一种统计方法 IGESS，通过整合个体水平的基因型数据和汇总统计数据，来提高识别风险变体的统计能力并提高风险预测的准确性。开发了一种基于变分推理的高效算法来处理全基因组分析。通过全面的模拟研究，我们证明了 IGESS 优于仅使用个体水平数据或汇总统计数据作为输入的方法的优势。我们应用 IGESS 对来自 WTCCC 的克罗恩病进行综合分析，并使用其他研究的汇总统计数据。IGESS 能够显著提高识别风险变体的统计能力，并将风险预测准确性从 63.2%（±0.4%）提高到 69.4%（±0.1%），使用了大约 240000 个变体。

可用性和实现

IGESS 软件可在 https://github.com/daviddaigithub/IGESS 获得。

联系人

zbxu@xjtu.edu.cn 或 xwan@comp.hkbu.edu.hk 或 eeyang@hkbu.edu.hk。

补充信息

补充数据可在 Bioinformatics 在线获得。

相似文献

IGESS: a statistical approach to integrating individual-level genotype data and summary statistics in genome-wide association studies.IGESS：一种在全基因组关联研究中整合个体水平基因型数据和汇总统计数据的统计方法。

Bioinformatics. 2017 Sep 15;33(18):2882-2889. doi: 10.1093/bioinformatics/btx314.

Joint analysis of individual-level and summary-level GWAS data by leveraging pleiotropy.利用多效性对个体水平和汇总水平 GWAS 数据进行联合分析。

Bioinformatics. 2019 May 15;35(10):1729-1736. doi: 10.1093/bioinformatics/bty870.

Fast and accurate imputation of summary statistics enhances evidence of functional enrichment.快速准确地推断汇总统计数据可增强功能富集的证据。

Bioinformatics. 2014 Oct 15;30(20):2906-14. doi: 10.1093/bioinformatics/btu416. Epub 2014 Jul 1.

LSMM: a statistical approach to integrating functional annotations with genome-wide association studies.LSMM：一种将功能注释与全基因组关联研究相结合的统计方法。

Bioinformatics. 2018 Aug 15;34(16):2788-2796. doi: 10.1093/bioinformatics/bty187.

LLR: a latent low-rank approach to colocalizing genetic risk variants in multiple GWAS.LLR：一种潜在的低秩方法，用于在多个 GWAS 中定位遗传风险变异。

Bioinformatics. 2017 Dec 15;33(24):3878-3886. doi: 10.1093/bioinformatics/btx512.

Powerful and efficient SNP-set association tests across multiple phenotypes using GWAS summary data.利用 GWAS 汇总数据对多种表型进行强大且高效的 SNP 集关联测试。

Bioinformatics. 2019 Apr 15;35(8):1366-1372. doi: 10.1093/bioinformatics/bty811.

Integrate multiple traits to detect novel trait-gene association using GWAS summary data with an adaptive test approach.利用 GWAS 汇总数据和自适应检验方法整合多种性状，以检测新的性状-基因关联。

Bioinformatics. 2019 Jul 1;35(13):2251-2257. doi: 10.1093/bioinformatics/bty961.

LPG: A four-group probabilistic approach to leveraging pleiotropy in genome-wide association studies.LPG：一种在全基因组关联研究中利用多效性的四组概率方法。

BMC Genomics. 2018 Jun 28;19(1):503. doi: 10.1186/s12864-018-4851-2.

PALM: a powerful and adaptive latent model for prioritizing risk variants with functional annotations.PALM：一种强大且自适应的潜在模型，用于对具有功能注释的风险变异进行优先级排序。

Bioinformatics. 2023 Feb 3;39(2). doi: 10.1093/bioinformatics/btad068.

GAUSS: a summary-statistics-based R package for accurate estimation of linkage disequilibrium for variants, Gaussian imputation, and TWAS analysis of cosmopolitan cohorts.GAUSS：一个基于汇总统计的 R 包，用于准确估计变体的连锁不平衡、高斯插补以及世界性队列的 TWAS 分析。

Bioinformatics. 2024 Mar 29;40(4). doi: 10.1093/bioinformatics/btae203.

引用本文的文献

The goldmine of GWAS summary statistics: a systematic review of methods and tools.全基因组关联研究汇总统计数据的宝库：方法与工具的系统综述

BioData Min. 2024 Sep 5;17(1):31. doi: 10.1186/s13040-024-00385-x.

PheSeq, a Bayesian deep learning model to enhance and interpret the gene-disease association studies.PheSeq，一种贝叶斯深度学习模型，用于增强和解释基因-疾病关联研究。

Genome Med. 2024 Apr 16;16(1):56. doi: 10.1186/s13073-024-01330-7.

Integrative analysis of individual-level data and high-dimensional summary statistics.个体水平数据与高维汇总统计量的综合分析。

Bioinformatics. 2023 Apr 3;39(4). doi: 10.1093/bioinformatics/btad156.

A two-sample robust Bayesian Mendelian Randomization method accounting for linkage disequilibrium and idiosyncratic pleiotropy with applications to the COVID-19 outcomes.一种两样本稳健贝叶斯孟德尔随机化方法，用于考虑连锁不平衡和个体特殊的多效性，并应用于 COVID-19 结局。

Genet Epidemiol. 2022 Apr;46(3-4):159-169. doi: 10.1002/gepi.22445. Epub 2022 Feb 22.

OmicsON - Integration of omics data with molecular networks and statistical procedures.OmicsON - 将组学数据与分子网络和统计程序进行整合。

PLoS One. 2020 Jul 29;15(7):e0235398. doi: 10.1371/journal.pone.0235398. eCollection 2020.

IGREX for quantifying the impact of genetically regulated expression on phenotypes.用于量化基因调控表达对表型影响的IGREX。

NAR Genom Bioinform. 2020 Mar;2(1):lqaa010. doi: 10.1093/nargab/lqaa010. Epub 2020 Feb 19.

LEP: A Statistical Method Integrating Individual-Level and Summary-Level Data of the Same Trait From Different Populations.LEP：一种整合来自不同人群相同性状的个体水平和汇总水平数据的统计方法。

Biomed Inform Insights. 2019 Oct 17;11:1178222619881624. doi: 10.1177/1178222619881624. eCollection 2019.

LPG: A four-group probabilistic approach to leveraging pleiotropy in genome-wide association studies.LPG：一种在全基因组关联研究中利用多效性的四组概率方法。

BMC Genomics. 2018 Jun 28;19(1):503. doi: 10.1186/s12864-018-4851-2.

本文引用的文献

Dissecting the genetics of complex traits using summary association statistics.利用汇总关联统计剖析复杂性状的遗传学。

Nat Rev Genet. 2017 Feb;18(2):117-127. doi: 10.1038/nrg.2016.142. Epub 2016 Nov 14.

EPS: an empirical Bayes approach to integrating pleiotropy and tissue-specific information for prioritizing risk genes.EPS：一种基于经验贝叶斯的方法，用于整合多效性和组织特异性信息，以优先考虑风险基因。

Bioinformatics. 2016 Jun 15;32(12):1856-64. doi: 10.1093/bioinformatics/btw081. Epub 2016 Feb 15.

Developing and evaluating polygenic risk prediction models for stratified disease prevention.开发和评估用于分层疾病预防的多基因风险预测模型。

Nat Rev Genet. 2016 Jul;17(7):392-406. doi: 10.1038/nrg.2016.27. Epub 2016 May 3.

Genome-wide genetic homogeneity between sexes and populations for human height and body mass index.人类身高和体重指数在性别与群体间的全基因组遗传同质性。

Hum Mol Genet. 2015 Dec 20;24(25):7445-9. doi: 10.1093/hmg/ddv443. Epub 2015 Oct 22.

LD Score regression distinguishes confounding from polygenicity in genome-wide association studies.LD评分回归在全基因组关联研究中区分混杂因素与多基因性。

Nat Genet. 2015 Mar;47(3):291-5. doi: 10.1038/ng.3211. Epub 2015 Feb 2.

Meta-analysis of correlated traits via summary statistics from GWASs with an application in hypertension.基于 GWASs 的汇总统计数据进行相关性状的荟萃分析及其在高血压中的应用。

Am J Hum Genet. 2015 Jan 8;96(1):21-36. doi: 10.1016/j.ajhg.2014.11.011. Epub 2014 Dec 11.

GPA: a statistical approach to prioritizing GWAS results by integrating pleiotropy and annotation.GPA：一种通过整合多效性和注释对全基因组关联研究结果进行优先级排序的统计方法。

PLoS Genet. 2014 Nov 13;10(11):e1004787. doi: 10.1371/journal.pgen.1004787. eCollection 2014 Nov.

Defining the role of common variation in the genomic and biological architecture of adult human height.确定常见变异在成年人类身高的基因组和生物学结构中的作用。

Nat Genet. 2014 Nov;46(11):1173-86. doi: 10.1038/ng.3097. Epub 2014 Oct 5.

The NHGRI GWAS Catalog, a curated resource of SNP-trait associations.NHGRI GWAS Catalog，一个经过精心策划的 SNP 与特征关联资源。

Nucleic Acids Res. 2014 Jan;42(Database issue):D1001-6. doi: 10.1093/nar/gkt1229. Epub 2013 Dec 6.

Sequence kernel association tests for the combined effect of rare and common variants.基于罕见和常见变异的联合效应的序列核关联检验。

Am J Hum Genet. 2013 Jun 6;92(6):841-53. doi: 10.1016/j.ajhg.2013.04.015. Epub 2013 May 16.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验