Department of Biostatistics, Columbia University, New York, NY 10032, USA.
Am J Hum Genet. 2011 Dec 9;89(6):701-12. doi: 10.1016/j.ajhg.2011.11.003. Epub 2011 Dec 1.
Many sequencing studies are now underway to identify the genetic causes for both Mendelian and complex traits. Via exome-sequencing, genes harboring variants implicated in several Mendelian traits have already been identified. The underlying methodology in these studies is a multistep algorithm based on filtering variants identified in a small number of affected individuals and depends on whether they are novel (not yet seen in public resources such as dbSNP), shared among affected individuals, and other external functional information on the variants. Although intuitive, these filter-based methods are nonoptimal and do not provide any measure of statistical uncertainty. We describe here a formal statistical approach that has several distinct advantages: (1) it provides fast computation of approximate p values for individual genes, (2) it adjusts for the background variation in each gene, (3) it allows for incorporation of functional or linkage-based information, and (4) it accommodates designs based on both affected relative pairs and unrelated affected individuals. We show via simulations that the proposed approach can be used in conjunction with the existing filter-based methods to achieve a substantially better ranking of a gene relevant for disease when compared to currently used filter-based approaches, this is especially so in the presence of disease locus heterogeneity. We revisit recent studies on three Mendelian diseases and show that the proposed approach results in the implicated gene being ranked first in all studies, and approximate p values of 10(-6) for the Miller Syndrome gene, 1.0 × 10(-4) for the Freeman-Sheldon Syndrome gene, and 3.5 × 10(-5) for the Kabuki Syndrome gene.
许多测序研究现在正在进行,以确定孟德尔和复杂性状的遗传原因。通过外显子组测序,已经确定了携带几种孟德尔性状相关变异的基因。这些研究中的基本方法是一种多步骤算法,基于对少数受影响个体中识别出的变体进行过滤,并且取决于它们是否是新颖的(尚未在公共资源如 dbSNP 中看到)、在受影响个体中共享,以及变体的其他外部功能信息。虽然直观,但这些基于过滤的方法不是最优的,并且不提供任何统计不确定性的度量。我们在这里描述一种正式的统计方法,它具有几个明显的优点:(1) 它为个体基因提供了快速计算近似 p 值的方法,(2) 它调整了每个基因中的背景变异,(3) 它允许包含功能或基于连锁的信息,以及 (4) 它适应了基于受影响相对对和无关受影响个体的设计。我们通过模拟表明,所提出的方法可以与现有的基于过滤的方法结合使用,与目前使用的基于过滤的方法相比,可以更有效地对与疾病相关的基因进行排名,在存在疾病位点异质性的情况下尤其如此。我们重新研究了最近关于三种孟德尔疾病的研究,并表明所提出的方法导致所涉及的基因在所有研究中排名第一,并且 Miller 综合征基因的近似 p 值为 10(-6),Freeman-Sheldon 综合征基因的近似 p 值为 1.0×10(-4),Kabuki 综合征基因的近似 p 值为 3.5×10(-5)。