Kang Guolian, Bi Wenjian, Zhang Hang, Pounds Stanley, Cheng Cheng, Shete Sanjay, Zou Fei, Zhao Yanlong, Zhang Ji-Feng, Yue Weihua
Department of Biostatistics, St. Jude Children's Research Hospital, Memphis, Tennessee 38105
Department of Biostatistics, St. Jude Children's Research Hospital, Memphis, Tennessee 38105.
Genetics. 2017 Mar;205(3):1049-1062. doi: 10.1534/genetics.116.192377. Epub 2016 Dec 30.
In many case-control designs of genome-wide association (GWAS) or next generation sequencing (NGS) studies, extensive data on secondary traits that may correlate and share the common genetic variants with the primary disease are available. Investigating these secondary traits can provide critical insights into the disease etiology or pathology, and enhance the GWAS or NGS results. Methods based on logistic regression (LG) were developed for this purpose. However, for the identification of rare variants (RVs), certain inadequacies in the LG models and algorithmic instability can cause severely inflated type I error, and significant loss of power, when the two traits are correlated and the RV is associated with the disease, especially at stringent significance levels. To address this issue, we propose a novel set-valued (SV) method that models a binary trait by dichotomization of an underlying continuous variable, and incorporate this into the genetic association model as a critical component. Extensive simulations and an analysis of seven secondary traits in a GWAS of benign ethnic neutropenia show that the SV method consistently controls type I error well at stringent significance levels, has larger power than the LG-based methods, and is robust in performance to effect pattern of the genetic variant (risk or protective), rare or common variants, rare or common diseases, and trait distributions. Because of the SV method's striking and profound advantage, we strongly recommend the SV method be employed instead of the LG-based methods for secondary traits analyses in case-control sequencing studies.
在许多全基因组关联研究(GWAS)或下一代测序(NGS)研究的病例对照设计中,可获得大量关于可能与原发性疾病相关并共享常见遗传变异的次要性状的数据。研究这些次要性状可以为疾病病因或病理提供关键见解,并增强GWAS或NGS的结果。为此开发了基于逻辑回归(LG)的方法。然而,对于罕见变异(RV)的识别,当两个性状相关且RV与疾病相关时,尤其是在严格的显著性水平下,LG模型中的某些不足和算法不稳定性可能会导致I型错误严重膨胀,以及显著的效能损失。为了解决这个问题,我们提出了一种新颖的集值(SV)方法,该方法通过对潜在连续变量进行二分来对二元性状进行建模,并将其作为关键组成部分纳入遗传关联模型。广泛的模拟以及对良性种族性中性粒细胞减少症GWAS中七个次要性状的分析表明,SV方法在严格的显著性水平下始终能很好地控制I型错误,比基于LG的方法具有更大的效能,并且在性能上对遗传变异的效应模式(风险或保护)、罕见或常见变异、罕见或常见疾病以及性状分布具有稳健性。由于SV方法具有显著而深刻的优势,我们强烈建议在病例对照测序研究中,采用SV方法而非基于LG的方法进行次要性状分析。