Sung Yun Ju, Korthauer Keegan D, Swartz Michael D, Engelman Corinne D
Division of Biostatistics, Washington University School of Medicine, St. Louis, Missouri, United States of America.
Genet Epidemiol. 2014 Sep;38 Suppl 1(0 1):S13-20. doi: 10.1002/gepi.21820.
Genetic Analysis Workshop 18 provided whole-genome sequence data in a pedigree-based sample and longitudinal phenotype data for hypertension and related traits, presenting an excellent opportunity for evaluating analysis choices. We summarize the nine contributions to the working group on collapsing methods, which evaluated various approaches for the analysis of multiple rare variants. One contributor defined a variant prioritization scheme, whereas the remaining eight contributors evaluated statistical methods for association analysis. Six contributors chose the gene as the genomic region for collapsing variants, whereas three contributors chose nonoverlapping sliding windows across the entire genome. Statistical methods spanned most of the published methods, including well-established burden tests, variance-components-type tests, and recently developed hybrid approaches. Lesser known methods, such as functional principal components analysis, higher criticism, and homozygosity association, and some newly introduced methods were also used. We found that performance of these methods depended on the characteristics of the genomic region, such as effect size and direction of variants under consideration. Except for MAP4 and FLT3, the performance of all statistical methods to identify rare casual variants was disappointingly poor, providing overall power almost identical to the type I error. This poor performance may have arisen from a combination of (1) small sample size, (2) small effects of most of the causal variants, explaining a small fraction of variance, (3) use of incomplete annotation information, and (4) linkage disequilibrium between causal variants in a gene and noncausal variants in nearby genes. Our findings demonstrate challenges in analyzing rare variants identified from sequence data.
遗传分析研讨会18提供了基于家系样本的全基因组序列数据以及高血压和相关性状的纵向表型数据,为评估分析选择提供了绝佳机会。我们总结了对折叠方法工作组的九项贡献,该工作组评估了多种分析多个罕见变异的方法。一位贡献者定义了变异优先级方案,而其余八位贡献者评估了关联分析的统计方法。六位贡献者选择基因作为折叠变异的基因组区域,而三位贡献者选择了覆盖整个基因组的非重叠滑动窗口。统计方法涵盖了大多数已发表的方法,包括成熟的负担检验、方差成分类型检验以及最近开发的混合方法。还使用了鲜为人知的方法,如功能主成分分析、高等批评和纯合性关联,以及一些新引入的方法。我们发现这些方法的性能取决于基因组区域的特征,如所考虑变异的效应大小和方向。除了MAP4和FLT3,所有识别罕见因果变异的统计方法的性能都差得令人失望,提供的总体效能几乎与I型错误相同。这种不佳的性能可能是由以下因素共同导致的:(1)样本量小;(2)大多数因果变异的效应小,解释的方差比例小;(3)使用不完整的注释信息;(4)基因中的因果变异与附近基因中的非因果变异之间的连锁不平衡。我们的研究结果表明了在分析从序列数据中识别出的罕见变异时所面临的挑战。