Lynch Michael, Bost Darius, Wilson Sade, Maruki Takahiro, Harrison Scott
Department of Biology, Indiana University, Bloomington
Department of Biology, North Carolina A&T State University.
Genome Biol Evol. 2014 Apr 30;6(5):1210-8. doi: 10.1093/gbe/evu085.
Although pooled-population sequencing has become a widely used approach for estimating allele frequencies, most work has proceeded in the absence of a proper statistical framework. We introduce a self-sufficient, closed-form, maximum-likelihood estimator for allele frequencies that accounts for errors associated with sequencing, and a likelihood-ratio test statistic that provides a simple means for evaluating the null hypothesis of monomorphism. Unbiased estimates of allele frequencies [Formula: see text] (where N is the number of individuals sampled) appear to be unachievable, and near-certain identification of a polymorphism requires a minor-allele frequency [Formula: see text]. A framework is provided for testing for significant differences in allele frequencies between populations, taking into account sampling at the levels of individuals within populations and sequences within pooled samples. Analyses that fail to account for the two tiers of sampling suffer from very large false-positive rates and can become increasingly misleading with increasing depths of sequence coverage. The power to detect significant allele-frequency differences between two populations is very limited unless both the number of sampled individuals and depth of sequencing coverage exceed 100.
尽管群体测序已成为估计等位基因频率的一种广泛使用的方法,但大多数工作是在缺乏适当统计框架的情况下进行的。我们引入了一种自给自足、封闭形式的等位基因频率最大似然估计器,该估计器考虑了与测序相关的误差,以及一种似然比检验统计量,它提供了一种评估单态性零假设的简单方法。等位基因频率的无偏估计(公式:见原文)(其中N是抽样个体的数量)似乎是无法实现的,并且几乎可以确定地识别多态性需要一个次要等位基因频率(公式:见原文)。提供了一个框架,用于检验不同群体之间等位基因频率的显著差异,同时考虑群体内个体层面和混合样本内序列层面的抽样。未考虑这两层抽样的分析会有非常高的假阳性率,并且随着序列覆盖深度的增加,可能会变得越来越具有误导性。除非抽样个体数量和测序覆盖深度都超过100,否则检测两个群体之间显著等位基因频率差异的能力非常有限。