Human Genetics Center, University of Texas School of Public Health, Houston, TX 77030, USA.
Genome Res. 2011 Jul;21(7):1099-108. doi: 10.1101/gr.115998.110. Epub 2011 Apr 26.
Genome-wide association studies (GWAS) have become the primary approach for identifying genes with common variants influencing complex diseases. Despite considerable progress, the common variations identified by GWAS account for only a small fraction of disease heritability and are unlikely to explain the majority of phenotypic variations of common diseases. A potential source of the missing heritability is the contribution of rare variants. Next-generation sequencing technologies will detect millions of novel rare variants, but these technologies have three defining features: identification of a large number of rare variants, a high proportion of sequence errors, and a large proportion of missing data. These features raise challenges for testing the association of rare variants with phenotypes of interest. In this study, we use a genome continuum model and functional principal components as a general principle for developing novel and powerful association analysis methods designed for resequencing data. We use simulations to calculate the type I error rates and the power of nine alternative statistics: two functional principal component analysis (FPCA)-based statistics, the multivariate principal component analysis (MPCA)-based statistic, the weighted sum (WSS), the variable-threshold (VT) method, the generalized T(2), the collapsing method, the CMC method, and individual tests. We also examined the impact of sequence errors on their type I error rates. Finally, we apply the nine statistics to the published resequencing data set from ANGPTL4 in the Dallas Heart Study. We report that FPCA-based statistics have a higher power to detect association of rare variants and a stronger ability to filter sequence errors than the other seven methods.
全基因组关联研究(GWAS)已成为鉴定常见变异影响复杂疾病的基因的主要方法。尽管取得了相当大的进展,但 GWAS 鉴定的常见变异仅占疾病遗传率的一小部分,不太可能解释常见疾病表型变异的大多数。遗传缺失的一个潜在来源是稀有变异的贡献。下一代测序技术将检测到数百万种新的稀有变异,但这些技术具有三个定义特征:大量稀有变异的识别、高比例的序列错误和大量缺失数据。这些特征为测试稀有变异与感兴趣的表型之间的关联带来了挑战。在这项研究中,我们使用基因组连续体模型和功能主成分作为开发针对重测序数据的新型强大关联分析方法的一般原则。我们使用模拟来计算 9 种替代统计量的Ⅰ型错误率和功效:两种基于功能主成分分析(FPCA)的统计量、基于多元主成分分析(MPCA)的统计量、加权和(WSS)、变量阈值(VT)方法、广义 T(2)、合并方法、CMC 方法和个体检验。我们还研究了序列错误对其Ⅰ型错误率的影响。最后,我们将这 9 种统计方法应用于达拉斯心脏研究中发表的 ANGPTL4 重测序数据集。我们报告称,基于 FPCA 的统计量在检测稀有变异的关联方面具有更高的功效,并且比其他七种方法具有更强的过滤序列错误的能力。