Program in Bioinformatics and Integrative Biology, University of Massachusetts Medical School Worcester, MA, USA ; Swiss Institute of Bioinformatics Lausanne, Switzerland.
Front Genet. 2013 Nov 11;4:235. doi: 10.3389/fgene.2013.00235. eCollection 2013.
With the increasing availability and quality of whole genome population data, various methodologies of population genetic inference are being utilized in order to identify and quantify recent population-level selective events. Though there has been a great proliferation of such methodology, the type-I and type-II error rates of many proposed statistics have not been well-described. Moreover, the performance of these statistics is often not evaluated for different biologically relevant scenarios (e.g., population size change, population structure), nor for the effect of differing data sizes (i.e., genomic vs. sub-genomic). The absence of the above information makes it difficult to evaluate newly available statistics relative to one another, and thus, difficult to choose the proper toolset for a given empirical analysis. Thus, we here describe and compare the performance of four widely used tests of selection: SweepFinder, SweeD, OmegaPlus, and iHS. In order to consider the above questions, we utilize simulated data spanning a variety of selection coefficients and beneficial mutation rates. We demonstrate that the LD-based OmegaPlus performs best in terms of power to reject the neutral model under both equilibrium and non-equilibrium conditions-an important result regarding the relative effectiveness of linkage disequilibrium relative to site frequency spectrum based statics. The results presented here ought to serve as a useful guide for future empirical studies, and provides a guide for statistical choice depending on the history of the population under consideration. Moreover, the parameter space investigated and the Type-I and Type-II error rates calculated, represent a natural benchmark by which future statistics may be assessed.
随着全基因组人群数据的可用性和质量不断提高,人们正在利用各种群体遗传推断方法来识别和量化近期的群体水平选择事件。虽然已经有了大量这样的方法,但许多提出的统计数据的 I 型和 II 型错误率并没有得到很好的描述。此外,这些统计数据的性能通常没有针对不同的生物学相关情况(例如,种群大小变化、种群结构)进行评估,也没有针对不同数据大小(即基因组与亚基因组)的效果进行评估。缺乏上述信息使得难以相互评估新的可用统计数据,因此难以为给定的经验分析选择合适的工具集。因此,我们在这里描述和比较了四种广泛使用的选择测试的性能:SweepFinder、SweeD、OmegaPlus 和 iHS。为了考虑上述问题,我们利用模拟数据来涵盖各种选择系数和有利突变率。我们证明,基于 LD 的 OmegaPlus 在平衡和非平衡条件下拒绝中性模型的能力最强——这是关于连锁不平衡相对于基于位点频率谱的统计学的相对有效性的重要结果。这里呈现的结果应该可以作为未来实证研究的有用指南,并根据所考虑的种群的历史提供统计选择的指南。此外,所调查的参数空间和计算的 I 型和 II 型错误率代表了一个自然的基准,未来的统计数据可以据此进行评估。