Shin Sunyoung, Keleş Sündüz
Department of Statistics, Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, USA.
Stat Biosci. 2017 Jun;9(1):50-72. doi: 10.1007/s12561-016-9154-z. Epub 2016 Aug 12.
Although genome-wide association studies (GWAS) have been successful at finding thousands of disease-associated genetic variants (GVs), identifying causal variants and elucidating the mechanisms by which genotypes influence phenotypes are critical open questions. A key challenge is that a large percentage of disease-associated GVs are potential regulatory variants located in noncoding regions, making them difficult to interpret. Recent research efforts focus on going beyond annotating GVs by integrating functional annotation data with GWAS to prioritize GVs. However, applicability of these approaches is challenged by high dimensionality and heterogeneity of functional annotation data. Furthermore, existing methods often assume global associations of GVs with annotation data. This strong assumption is susceptible to violations for GVs involved in many complex diseases. To address these issues, we develop a general regression framework, named nnotation egression fr WAS (ARoG). ARoG is based on a finite mixture of linear regressions model where GWAS association measures are viewed as responses and functional annotations as predictors. This mixture framework addresses heterogeneity of effects of GVs by grouping them into clusters and high dimensionality of the functional annotations by enabling annotation selection within each cluster. ARoG further employs permutation testing to evaluate the significance of selected annotations. Computational experiments indicate that ARoG can discover distinct associations between disease risk and functional annotations. Application of ARoG to autism and schizophrenia data from Psychiatric Genomics Consortium led to identification of GVs that significantly affect interactions of several transcription factors with DNA as potential mechanisms contributing to these disorders.
尽管全基因组关联研究(GWAS)已成功发现数千种与疾病相关的基因变异(GV),但确定因果变异以及阐明基因型影响表型的机制仍是关键的开放性问题。一个关键挑战在于,很大一部分与疾病相关的GV是位于非编码区域的潜在调控变异,这使得它们难以解读。近期的研究工作重点在于通过将功能注释数据与GWAS整合,对GV进行优先级排序,从而超越对GV的注释。然而,这些方法的适用性受到功能注释数据的高维度和异质性的挑战。此外,现有方法通常假定GV与注释数据存在全局关联。对于许多复杂疾病所涉及的GV而言,这种强假设很容易被违反。为了解决这些问题,我们开发了一个通用回归框架,名为GWAS注释回归(ARoG)。ARoG基于线性回归模型的有限混合,其中GWAS关联度量被视为响应变量,功能注释被视为预测变量。这种混合框架通过将GV分组到不同簇中来解决GV效应的异质性,并通过在每个簇内进行注释选择来解决功能注释的高维度问题。ARoG进一步采用置换检验来评估所选注释的显著性。计算实验表明,ARoG能够发现疾病风险与功能注释之间的不同关联。将ARoG应用于精神疾病基因组学联盟的自闭症和精神分裂症数据,导致识别出一些GV,这些GV显著影响几种转录因子与DNA的相互作用,这是导致这些疾病的潜在机制。