Department of Biostatistics, Harvard School of Public Health, Boston, Massachusetts.
BMC Genet. 2013 Nov 7;14:108. doi: 10.1186/1471-2156-14-108.
The advent of genome-wide association studies has led to many novel disease-SNP associations, opening the door to focused study on their biological underpinnings. Because of the importance of analyzing these associations, numerous statistical methods have been devoted to them. However, fewer methods have attempted to associate entire genes or genomic regions with outcomes, which is potentially more useful knowledge from a biological perspective and those methods currently implemented are often permutation-based.
One property of some permutation-based tests is that their power varies as a function of whether significant markers are in regions of linkage disequilibrium (LD) or not, which we show from a theoretical perspective. We therefore develop two methods for quantifying the degree of association between a genomic region and outcome, both of whose power does not vary as a function of LD structure. One method uses dimension reduction to "filter" redundant information when significant LD exists in the region, while the other, called the summary-statistic test, controls for LD by scaling marker Z-statistics using knowledge of the correlation matrix of markers. An advantage of this latter test is that it does not require the original data, but only their Z-statistics from univariate regressions and an estimate of the correlation structure of markers, and we show how to modify the test to protect the type 1 error rate when the correlation structure of markers is misspecified. We apply these methods to sequence data of oral cleft and compare our results to previously proposed gene tests, in particular permutation-based ones. We evaluate the versatility of the modification of the summary-statistic test since the specification of correlation structure between markers can be inaccurate.
We find a significant association in the sequence data between the 8q24 region and oral cleft using our dimension reduction approach and a borderline significant association using the summary-statistic based approach. We also implement the summary-statistic test using Z-statistics from an already-published GWAS of Chronic Obstructive Pulmonary Disorder (COPD) and correlation structure obtained from HapMap. We experiment with the modification of this test because the correlation structure is assumed imperfectly known.
全基因组关联研究的出现导致了许多新的疾病-SNP 关联,为研究其生物学基础开辟了道路。由于分析这些关联的重要性,许多统计方法都致力于此。然而,很少有方法试图将整个基因或基因组区域与结果联系起来,从生物学角度来看,这可能是更有用的知识,而目前实施的那些方法通常是基于排列的。
从理论角度出发,我们展示了一些排列检验的一个特性,即它们的功效随显著标记是否处于连锁不平衡(LD)区域而变化。因此,我们开发了两种用于量化基因组区域与结果之间关联程度的方法,这两种方法的功效都不随 LD 结构的变化而变化。一种方法使用降维来“过滤”区域中存在显著 LD 时的冗余信息,而另一种方法,称为汇总统计检验,通过使用标记相关矩阵的知识来缩放标记 Z 统计量来控制 LD。后者检验的一个优点是它不需要原始数据,只需要它们来自单变量回归的 Z 统计量和标记相关结构的估计值,并且我们展示了如何修改检验以保护标记相关结构指定错误时的类型 1错误率。我们将这些方法应用于口腔裂的序列数据,并将我们的结果与以前提出的基因检验进行比较,特别是基于排列的检验。我们评估了汇总统计检验的修改方法的多功能性,因为标记之间的相关结构的规范可能不准确。
我们使用我们的降维方法在序列数据中发现 8q24 区域与口腔裂之间存在显著关联,并且使用基于汇总统计的方法发现了边缘显著关联。我们还使用已经发表的慢性阻塞性肺疾病(COPD)全基因组关联研究的 Z 统计量和从 HapMap 获得的相关结构来实施汇总统计检验。我们尝试修改该检验,因为相关结构被假定为不完全已知。