Xu Zheng, Duan Qing, Yan Song, Chen Wei, Li Mingyao, Lange Ethan, Li Yun
Department of Biostatistics, Department of Genetics, Department of Computer Science.
Department of Genetics, Curriculum in Bioinformatics and Computational Biology, Department of Statistics, University of North Carolina, Chapel Hill, NC 27599, USA.
Bioinformatics. 2015 Aug 1;31(15):2434-42. doi: 10.1093/bioinformatics/btv168. Epub 2015 Mar 24.
Imputation of individual level genotypes at untyped markers using an external reference panel of genotyped or sequenced individuals has become standard practice in genetic association studies. Direct imputation of summary statistics can also be valuable, for example in meta-analyses where individual level genotype data are not available. Two methods (DIST and ImpG-Summary/LD), that assume a multivariate Gaussian distribution for the association summary statistics, have been proposed for imputing association summary statistics. However, both methods assume that the correlations between association summary statistics are the same as the correlations between the corresponding genotypes. This assumption can be violated in the presence of confounding covariates.
We analytically show that in the absence of covariates, correlation among association summary statistics is indeed the same as that among the corresponding genotypes, thus serving as a theoretical justification for the recently proposed methods. We continue to prove that in the presence of covariates, correlation among association summary statistics becomes the partial correlation of the corresponding genotypes controlling for covariates. We therefore develop direct imputation of summary statistics allowing covariates (DISSCO).
We consider two real-life scenarios where the correlation and partial correlation likely make practical difference: (i) association studies in admixed populations; (ii) association studies in presence of other confounding covariate(s). Application of DISSCO to real datasets under both scenarios shows at least comparable, if not better, performance compared with existing correlation-based methods, particularly for lower frequency variants. For example, DISSCO can reduce the absolute deviation from the truth by 3.9-15.2% for variants with minor allele frequency <5%.
利用已分型或测序个体的外部参考面板对未分型标记处的个体水平基因型进行插补,已成为基因关联研究的标准做法。直接插补汇总统计量也可能很有价值,例如在无法获得个体水平基因型数据的荟萃分析中。已经提出了两种假设关联汇总统计量服从多元高斯分布的方法(DIST和ImpG-Summary/LD)来插补关联汇总统计量。然而,这两种方法都假设关联汇总统计量之间的相关性与相应基因型之间的相关性相同。在存在混杂协变量的情况下,这一假设可能不成立。
我们通过分析表明,在没有协变量的情况下,关联汇总统计量之间的相关性确实与相应基因型之间的相关性相同,从而为最近提出的方法提供了理论依据。我们继续证明,在存在协变量的情况下,关联汇总统计量之间的相关性变为控制协变量的相应基因型的偏相关性。因此,我们开发了允许协变量的汇总统计量直接插补方法(DISSCO)。
我们考虑了两种现实情况,其中相关性和偏相关性可能会产生实际差异:(i)混合人群中的关联研究;(ii)存在其他混杂协变量的关联研究。在这两种情况下,将DISSCO应用于实际数据集显示,与现有的基于相关性的方法相比,其性能至少相当,甚至更好,特别是对于低频变异。例如,对于次要等位基因频率<5%的变异,DISSCO可以将与真实值的绝对偏差降低3.9-15.2%。