Museum of Natural Science, Louisiana State University, Baton Rouge, Louisiana, United States of America.
PLoS One. 2007 Jan 17;2(1):e160. doi: 10.1371/journal.pone.0000160.
Theoretical work suggests that data from multiple nuclear loci provide better estimates of population genetic parameters than do single loci, but just how many loci are needed and how much sequence is required from each has been little explored.
METHODOLOGY/PRINCIPLE FINDINGS: To investigate how much data is required to estimate the population genetic parameter theta (4N(e)mu) accurately under ideal circumstances, we simulated datasets of DNA sequences under three values of theta per site (0.1, 0.01, 0.001), varying in both the total number of base pairs sequenced per individual and the number of equal-length loci. From these datasets we estimated theta using the maximum likelihood coalescent framework implemented in the computer program Migrate. Our results corroborated the theoretical expectation that increasing the number of loci impacted the accuracy of the estimate more than increasing the sequence length at single loci. However, when the value of theta was low (0.001), the per-locus sequence length was also important for estimating theta accurately, something that has not been emphasized in previous work.
CONCLUSIONS/SIGNIFICANCE: Accurate estimation of theta required data from at least 25 independently evolving loci. Beyond this, there was little added benefit in terms of decreasing the squared coefficient of variation of the coalescent estimates relative to the extra effort required to sample more loci.
理论研究表明,与单一基因座相比,多个基因座的数据可提供更准确的群体遗传参数估计,但需要多少个基因座以及每个基因座需要多少序列一直以来都鲜有探讨。
方法/原理发现:为了在理想情况下准确估计群体遗传参数 theta(4N(e)mu)所需的数据量,我们模拟了每个基因座三种 theta 值(0.1、0.01、0.001)的 DNA 序列数据集,每个个体的测序碱基对总数和等长基因座数量均有所不同。我们使用计算机程序 Migrate 中的最大似然合并框架从这些数据集中估计了 theta。我们的研究结果证实了理论预期,即增加基因座数量比增加单个基因座的序列长度对估计精度的影响更大。但是,当 theta 值较低(0.001)时,准确估计 theta 还需要每个基因座的序列长度,这在之前的研究中并未得到强调。
结论/意义:准确估计 theta 需要至少 25 个独立进化的基因座的数据。除此之外,相对于增加采样更多基因座所需的额外工作量,减少合并估计的平方变异系数方面几乎没有额外的好处。