School of Mathematics and Statistics, Xi'an Jiaotong University, Xi'an, China.
Department of Mathematics, Hong Kong Baptist University, Hong Kong.
Bioinformatics. 2017 Sep 15;33(18):2882-2889. doi: 10.1093/bioinformatics/btx314.
Results from genome-wide association studies (GWAS) suggest that a complex phenotype is often affected by many variants with small effects, known as 'polygenicity'. Tens of thousands of samples are often required to ensure statistical power of identifying these variants with small effects. However, it is often the case that a research group can only get approval for the access to individual-level genotype data with a limited sample size (e.g. a few hundreds or thousands). Meanwhile, summary statistics generated using single-variant-based analysis are becoming publicly available. The sample sizes associated with the summary statistics datasets are usually quite large. How to make the most efficient use of existing abundant data resources largely remains an open question.
In this study, we propose a statistical approach, IGESS, to increasing statistical power of identifying risk variants and improving accuracy of risk prediction by i ntegrating individual level ge notype data and s ummary s tatistics. An efficient algorithm based on variational inference is developed to handle the genome-wide analysis. Through comprehensive simulation studies, we demonstrated the advantages of IGESS over the methods which take either individual-level data or summary statistics data as input. We applied IGESS to perform integrative analysis of Crohns Disease from WTCCC and summary statistics from other studies. IGESS was able to significantly increase the statistical power of identifying risk variants and improve the risk prediction accuracy from 63.2% ( ±0.4% ) to 69.4% ( ±0.1% ) using about 240 000 variants.
The IGESS software is available at https://github.com/daviddaigithub/IGESS .
zbxu@xjtu.edu.cn or xwan@comp.hkbu.edu.hk or eeyang@hkbu.edu.hk.
Supplementary data are available at Bioinformatics online.
全基因组关联研究(GWAS)的结果表明,复杂的表型通常受到许多具有小效应的变体的影响,这些变体被称为“多效性”。为了确保识别这些具有小效应的变体的统计能力,通常需要成千上万的样本。然而,研究小组通常只能获得访问个体水平基因型数据的批准,而样本量有限(例如几百或几千个)。同时,基于单变量分析生成的汇总统计数据正变得越来越公开。与汇总统计数据集相关的样本量通常相当大。如何最有效地利用现有的丰富数据资源在很大程度上仍然是一个悬而未决的问题。
在这项研究中,我们提出了一种统计方法 IGESS,通过整合个体水平的基因型数据和汇总统计数据,来提高识别风险变体的统计能力并提高风险预测的准确性。开发了一种基于变分推理的高效算法来处理全基因组分析。通过全面的模拟研究,我们证明了 IGESS 优于仅使用个体水平数据或汇总统计数据作为输入的方法的优势。我们应用 IGESS 对来自 WTCCC 的克罗恩病进行综合分析,并使用其他研究的汇总统计数据。IGESS 能够显著提高识别风险变体的统计能力,并将风险预测准确性从 63.2%(±0.4%)提高到 69.4%(±0.1%),使用了大约 240000 个变体。
IGESS 软件可在 https://github.com/daviddaigithub/IGESS 获得。
zbxu@xjtu.edu.cn 或 xwan@comp.hkbu.edu.hk 或 eeyang@hkbu.edu.hk。
补充数据可在 Bioinformatics 在线获得。