Human Genetics Center and Division of Biostatistics, The University of Texas School of Public Health, Houston, 77030, USA.
Am J Hum Genet. 2012 Jun 8;90(6):1028-45. doi: 10.1016/j.ajhg.2012.04.022.
An individual's disease risk is determined by the compounded action of both common variants, inherited from remote ancestors, that segregated within the population and rare variants, inherited from recent ancestors, that segregated mainly within pedigrees. Next-generation sequencing (NGS) technologies generate high-dimensional data that allow a nearly complete evaluation of genetic variation. Despite their promise, NGS technologies also suffer from remarkable limitations: high error rates, enrichment of rare variants, and a large proportion of missing values, as well as the fact that most current analytical methods are designed for population-based association studies. To meet the analytical challenges raised by NGS, we propose a general framework for sequence-based association studies that can use various types of family and unrelated-individual data sampled from any population structure and a universal procedure that can transform any population-based association test statistic for use in family-based association tests. We develop family-based functional principal-component analysis (FPCA) with or without smoothing, a generalized T(2), combined multivariate and collapsing (CMC) method, and single-marker association test statistics. Through intensive simulations, we demonstrate that the family-based smoothed FPCA (SFPCA) has the correct type I error rates and much more power to detect association of (1) common variants, (2) rare variants, (3) both common and rare variants, and (4) variants with opposite directions of effect from other population-based or family-based association analysis methods. The proposed statistics are applied to two data sets with pedigree structures. The results show that the smoothed FPCA has a much smaller p value than other statistics.
个体的疾病风险是由共同变体和罕见变体共同作用决定的,这些变体既有来自远古祖先的常见遗传变体,也有来自近代祖先的罕见遗传变体。这些变体在人群中分离,或者主要在系谱中分离。下一代测序(NGS)技术产生的高维数据几乎可以完全评估遗传变异。尽管 NGS 技术有很大的潜力,但它们也有显著的局限性:高错误率、稀有变体的富集、大量缺失值,以及大多数当前分析方法是为基于人群的关联研究而设计的。为了应对 NGS 带来的分析挑战,我们提出了一个基于序列的关联研究的通用框架,该框架可以使用来自任何人群结构的各种类型的家族和无关个体数据,以及一种通用的程序,可以将任何基于人群的关联测试统计量转换为用于家族关联测试的统计量。我们开发了基于家族的功能主成分分析(FPCA),包括平滑和非平滑的 FPCA、广义 T(2)、组合多变量和合并(CMC)方法以及单标记关联测试统计量。通过密集的模拟,我们证明了基于家族的平滑 FPCA(SFPCA)具有正确的 I 型错误率,并且在检测(1)常见变体、(2)罕见变体、(3)常见和罕见变体以及(4)与其他基于人群或家族的关联分析方法具有相反作用方向的变体的关联方面具有更高的功效。所提出的统计方法应用于具有系谱结构的两个数据集。结果表明,平滑 FPCA 的 p 值比其他统计方法小得多。