Department of Mathematics, Hong Kong Baptist University, Hong Kong.
School of Mathematics and Statistics, Xi'an Jiaotong University, Xi'an, China.
Bioinformatics. 2018 Aug 15;34(16):2788-2796. doi: 10.1093/bioinformatics/bty187.
Thousands of risk variants underlying complex phenotypes (quantitative traits and diseases) have been identified in genome-wide association studies (GWAS). However, there are still two major challenges towards deepening our understanding of the genetic architectures of complex phenotypes. First, the majority of GWAS hits are in non-coding region and their biological interpretation is still unclear. Second, accumulating evidence from GWAS suggests the polygenicity of complex traits, i.e. a complex trait is often affected by many variants with small or moderate effects, whereas a large proportion of risk variants with small effects remain unknown.
The availability of functional annotation data enables us to address the above challenges. In this study, we propose a latent sparse mixed model (LSMM) to integrate functional annotations with GWAS data. Not only does it increase the statistical power of identifying risk variants, but also offers more biological insights by detecting relevant functional annotations. To allow LSMM scalable to millions of variants and hundreds of functional annotations, we developed an efficient variational expectation-maximization algorithm for model parameter estimation and statistical inference. We first conducted comprehensive simulation studies to evaluate the performance of LSMM. Then we applied it to analyze 30 GWAS of complex phenotypes integrated with nine genic category annotations and 127 cell-type specific functional annotations from the Roadmap project. The results demonstrate that our method possesses more statistical power than conventional methods, and can help researchers achieve deeper understanding of genetic architecture of these complex phenotypes.
The LSMM software is available at https://github.com/mingjingsi/LSMM.
Supplementary data are available at Bioinformatics online.
在全基因组关联研究 (GWAS) 中已经确定了数千个复杂表型(定量性状和疾病)的风险变异。然而,在深入了解复杂表型的遗传结构方面仍然存在两个主要挑战。首先,大多数 GWAS 命中都在非编码区域,其生物学解释仍不清楚。其次,来自 GWAS 的累积证据表明复杂性状的多基因性,即复杂性状通常受到许多具有小或中等效应的变异的影响,而大量具有小效应的风险变异仍然未知。
功能注释数据的可用性使我们能够解决上述挑战。在这项研究中,我们提出了一种潜在稀疏混合模型(LSMM),将功能注释与 GWAS 数据集成在一起。它不仅提高了识别风险变异的统计能力,而且通过检测相关的功能注释提供了更多的生物学见解。为了使 LSMM 能够扩展到数百万个变体和数百个功能注释,我们开发了一种有效的变分期望最大化算法来进行模型参数估计和统计推断。我们首先进行了全面的模拟研究,以评估 LSMM 的性能。然后,我们将其应用于分析 30 个复杂表型的 GWAS,这些表型与来自 Roadmap 项目的九个基因类别注释和 127 个细胞类型特异性功能注释集成在一起。结果表明,我们的方法比传统方法具有更高的统计能力,并有助于研究人员更深入地了解这些复杂表型的遗传结构。
LSMM 软件可在 https://github.com/mingjingsi/LSMM 上获得。
补充数据可在生物信息学在线获得。