Department of Biostatistics, College of Public Health, University of Kentucky, Lexington, Kentucky, United States of America.
Biostatistics and Computational Biology, National Institute of Environmental Health Sciences, National Institutes of Health, Research Triangle Park, North Carolina, United States of America.
PLoS Comput Biol. 2020 Apr 14;16(4):e1007819. doi: 10.1371/journal.pcbi.1007819. eCollection 2020 Apr.
Historically, the majority of statistical association methods have been designed assuming availability of SNP-level information. However, modern genetic and sequencing data present new challenges to access and sharing of genotype-phenotype datasets, including cost of management, difficulties in consolidation of records across research groups, etc. These issues make methods based on SNP-level summary statistics particularly appealing. The most common form of combining statistics is a sum of SNP-level squared scores, possibly weighted, as in burden tests for rare variants. The overall significance of the resulting statistic is evaluated using its distribution under the null hypothesis. Here, we demonstrate that this basic approach can be substantially improved by decorrelating scores prior to their addition, resulting in remarkable power gains in situations that are most commonly encountered in practice; namely, under heterogeneity of effect sizes and diversity between pairwise LD. In these situations, the power of the traditional test, based on the added squared scores, quickly reaches a ceiling, as the number of variants increases. Thus, the traditional approach does not benefit from information potentially contained in any additional SNPs, while our decorrelation by orthogonal transformation (DOT) method yields steady gain in power. We present theoretical and computational analyses of both approaches, and reveal causes behind sometimes dramatic difference in their respective powers. We showcase DOT by analyzing breast cancer and cleft lip data, in which our method strengthened levels of previously reported associations and implied the possibility of multiple new alleles that jointly confer disease risk.
从历史上看,大多数统计关联方法都是在假设 SNP 水平信息可用的情况下设计的。然而,现代遗传和测序数据为访问和共享基因型-表型数据集带来了新的挑战,包括管理成本、研究小组之间记录整合的困难等。这些问题使得基于 SNP 水平汇总统计的方法特别吸引人。组合统计数据最常见的形式是 SNP 水平平方得分的总和,可能会加权,就像稀有变异的负担测试一样。使用其在零假设下的分布来评估由此产生的统计量的整体显著性。在这里,我们证明,通过在相加之前对分数进行去相关,可以大大改进这种基本方法,从而在实践中最常见的情况下获得显著的功效增益;即在效应大小异质性和成对 LD 之间的多样性的情况下。在这些情况下,基于添加的平方得分的传统检验的功效很快达到上限,因为变体数量的增加。因此,传统方法无法从任何额外 SNP 中包含的潜在信息中获益,而我们的正交变换(DOT)去相关方法则可以稳定地提高功效。我们对这两种方法进行了理论和计算分析,并揭示了它们各自功效之间有时存在显著差异的原因。我们通过分析乳腺癌和唇裂数据展示了 DOT 的应用,我们的方法增强了先前报道的关联水平,并暗示了可能存在多个共同赋予疾病风险的新等位基因。