Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA.
Department of Population Health Sciences, University of Utah, Salt Lake City, UT 84108, USA.
Am J Hum Genet. 2019 May 2;104(5):802-814. doi: 10.1016/j.ajhg.2019.03.002. Epub 2019 Apr 12.
Whole-genome sequencing (WGS) studies are being widely conducted in order to identify rare variants associated with human diseases and disease-related traits. Classical single-marker association analyses for rare variants have limited power, and variant-set-based analyses are commonly used by researchers for analyzing rare variants. However, existing variant-set-based approaches need to pre-specify genetic regions for analysis; hence, they are not directly applicable to WGS data because of the large number of intergenic and intron regions that consist of a massive number of non-coding variants. The commonly used sliding-window method requires the pre-specification of fixed window sizes, which are often unknown as a priori, are difficult to specify in practice, and are subject to limitations given that the sizes of genetic-association regions are likely to vary across the genome and phenotypes. We propose a computationally efficient and dynamic scan-statistic method (Scan the Genome [SCANG]) for analyzing WGS data; this method flexibly detects the sizes and the locations of rare-variant association regions without the need to specify a prior, fixed window size. The proposed method controls for the genome-wise type I error rate and accounts for the linkage disequilibrium among genetic variants. It allows the detected sizes of rare-variant association regions to vary across the genome. Through extensive simulated studies that consider a wide variety of scenarios, we show that SCANG substantially outperforms several alternative methods for detecting rare-variant-associations while controlling for the genome-wise type I error rates. We illustrate SCANG by analyzing the WGS lipids data from the Atherosclerosis Risk in Communities (ARIC) study.
全基因组测序(WGS)研究正在广泛进行,以鉴定与人类疾病和疾病相关特征相关的罕见变异。罕见变异的经典单标记关联分析的功效有限,研究人员通常使用基于变异集的分析方法来分析罕见变异。然而,现有的基于变异集的方法需要预先指定用于分析的遗传区域;因此,由于包含大量非编码变异的基因间和内含子区域数量众多,它们不能直接应用于 WGS 数据。常用的滑动窗口方法需要预先指定固定的窗口大小,但这些窗口大小通常是未知的,在实践中很难指定,并且受到限制,因为遗传关联区域的大小可能因基因组和表型而异。我们提出了一种计算高效且动态的扫描统计方法(扫描基因组[SCANG])来分析 WGS 数据;该方法灵活地检测罕见变异关联区域的大小和位置,而无需预先指定固定的窗口大小。所提出的方法控制全基因组的 I 型错误率,并考虑遗传变异之间的连锁不平衡。它允许检测到的罕见变异关联区域的大小在整个基因组中变化。通过考虑各种情况的广泛模拟研究,我们表明,SCANG 在控制全基因组 I 型错误率的同时,大大优于几种用于检测罕见变异关联的替代方法。我们通过分析动脉粥样硬化风险社区(ARIC)研究中的 WGS 脂质数据来展示 SCANG。