Hou Lin, Sun Ning, Mane Shrikant, Sayward Fred, Rajeevan Nallakkandi, Cheung Kei-Hoi, Cho Kelly, Pyarajan Saiju, Aslan Mihaela, Miller Perry, Harvey Philip D, Gaziano J Michael, Concato John, Zhao Hongyu
Clinical Epidemiology Research Center (CERC), Veterans Affairs (VA) Cooperative Studies Program, VA Connecticut Healthcare System, West Haven, Connecticut, United States of America.
Department of Biostatistics, Yale School of Public Health, New Haven, Connecticut, United States of America.
Genet Epidemiol. 2017 Feb;41(2):152-162. doi: 10.1002/gepi.22027. Epub 2016 Dec 26.
A key step in genomic studies is to assess high throughput measurements across millions of markers for each participant's DNA, either using microarrays or sequencing techniques. Accurate genotype calling is essential for downstream statistical analysis of genotype-phenotype associations, and next generation sequencing (NGS) has recently become a more common approach in genomic studies. How the accuracy of variant calling in NGS-based studies affects downstream association analysis has not, however, been studied using empirical data in which both microarrays and NGS were available. In this article, we investigate the impact of variant calling errors on the statistical power to identify associations between single nucleotides and disease, and on associations between multiple rare variants and disease. Both differential and nondifferential genotyping errors are considered. Our results show that the power of burden tests for rare variants is strongly influenced by the specificity in variant calling, but is rather robust with regard to sensitivity. By using the variant calling accuracies estimated from a substudy of a Cooperative Studies Program project conducted by the Department of Veterans Affairs, we show that the power of association tests is mostly retained with commonly adopted variant calling pipelines. An R package, GWAS.PC, is provided to accommodate power analysis that takes account of genotyping errors (http://zhaocenter.org/software/).
基因组研究中的一个关键步骤是,针对每个参与者的DNA,使用微阵列或测序技术对数以百万计的标记进行高通量测量。准确的基因型判定对于基因型-表型关联的下游统计分析至关重要,而新一代测序(NGS)最近已成为基因组研究中一种更为常用的方法。然而,基于NGS的研究中变异体判定的准确性如何影响下游关联分析,尚未使用同时具备微阵列和NGS数据的实证数据进行研究。在本文中,我们研究了变异体判定错误对识别单核苷酸与疾病之间关联以及多个罕见变异体与疾病之间关联的统计效力的影响。我们同时考虑了差异性和非差异性基因分型错误。我们的结果表明,罕见变异体负担检验的效力受到变异体判定特异性的强烈影响,但对敏感性而言相当稳健。通过使用从美国退伍军人事务部开展的合作研究项目的一个子研究中估计出的变异体判定准确性,我们表明,采用常用的变异体判定流程时,关联检验的效力大多得以保留。我们提供了一个R包GWAS.PC,以用于考虑基因分型错误的效力分析(http://zhaocenter.org/software/)。