Luedtke Alexander, Powers Scott, Petersen Ashley, Sitarik Alexandra, Bekmetjev Airat, Tintle Nathan L
Division of Applied Mathematics, Brown University, 182 George Street, Providence, RI 02912, USA.
Department of Statistics and Operations Research, 318 Hanes Hall, CB 3260, University of North Carolina, Chapel Hill, NC 27599-3260, USA.
BMC Proc. 2011 Nov 29;5 Suppl 9(Suppl 9):S119. doi: 10.1186/1753-6561-5-S9-S119.
A number of rare variant statistical methods have been proposed for analysis of the impending wave of next-generation sequencing data. To date, there are few direct comparisons of these methods on real sequence data. Furthermore, there is a strong need for practical advice on the proper analytic strategies for rare variant analysis. We compare four recently proposed rare variant methods (combined multivariate and collapsing, weighted sum, proportion regression, and cumulative minor allele test) on simulated phenotype and next-generation sequencing data as part of Genetic Analysis Workshop 17. Overall, we find that all analyzed methods have serious practical limitations on identifying causal genes. Specifically, no method has more than a 5% true discovery rate (percentage of truly causal genes among all those identified as significantly associated with the phenotype). Further exploration shows that all methods suffer from inflated false-positive error rates (chance that a noncausal gene will be identified as associated with the phenotype) because of population stratification and gametic phase disequilibrium between noncausal SNPs and causal SNPs. Furthermore, observed true-positive rates (chance that a truly causal gene will be identified as significantly associated with the phenotype) for each of the four methods was very low (<19%). The combination of larger than anticipated false-positive rates, low true-positive rates, and only about 1% of all genes being causal yields poor discriminatory ability for all four methods. Gametic phase disequilibrium and population stratification are important areas for further research in the analysis of rare variant data.
为了分析即将到来的新一代测序数据浪潮,人们已经提出了一些罕见变异统计方法。到目前为止,在真实序列数据上对这些方法进行的直接比较很少。此外,对于罕见变异分析的适当分析策略,非常需要实用的建议。作为遗传分析研讨会17的一部分,我们在模拟表型和新一代测序数据上比较了四种最近提出的罕见变异方法(联合多变量和压缩法、加权和法、比例回归法和累积次要等位基因检验法)。总体而言,我们发现所有分析方法在识别因果基因方面都存在严重的实际局限性。具体来说,没有一种方法的真发现率超过5%(在所有被确定与表型显著相关的基因中,真正因果基因的百分比)。进一步的探索表明,由于群体分层以及非因果单核苷酸多态性(SNP)与因果SNP之间的配子相位不平衡,所有方法都存在虚高的假阳性错误率(非因果基因被确定与表型相关的概率)。此外,这四种方法各自的观察到的真阳性率(真正因果基因被确定与表型显著相关的概率)非常低(<19%)。高于预期的假阳性率、低真阳性率以及所有基因中只有约1%是因果基因的情况相结合,导致这四种方法的鉴别能力都很差。配子相位不平衡和群体分层是罕见变异数据分析中有待进一步研究的重要领域。