Department of Integrative Biology, University of California, Berkeley, California 94720.
Genetics. 2013 Nov;195(3):979-92. doi: 10.1534/genetics.113.154740. Epub 2013 Aug 26.
Over the past few years, new high-throughput DNA sequencing technologies have dramatically increased speed and reduced sequencing costs. However, the use of these sequencing technologies is often challenged by errors and biases associated with the bioinformatical methods used for analyzing the data. In particular, the use of naïve methods to identify polymorphic sites and infer genotypes can inflate downstream analyses. Recently, explicit modeling of genotype probability distributions has been proposed as a method for taking genotype call uncertainty into account. Based on this idea, we propose a novel method for quantifying population genetic differentiation from next-generation sequencing data. In addition, we present a strategy for investigating population structure via principal components analysis. Through extensive simulations, we compare the new method herein proposed to approaches based on genotype calling and demonstrate a marked improvement in estimation accuracy for a wide range of conditions. We apply the method to a large-scale genomic data set of domesticated and wild silkworms sequenced at low coverage. We find that we can infer the fine-scale genetic structure of the sampled individuals, suggesting that employing this new method is useful for investigating the genetic relationships of populations sampled at low coverage.
在过去的几年中,新的高通量 DNA 测序技术大大提高了速度并降低了测序成本。然而,这些测序技术的使用经常受到与用于分析数据的生物信息学方法相关的错误和偏差的挑战。特别是,使用天真的方法来识别多态性位点并推断基因型可能会夸大下游分析。最近,已经提出了显式建模基因型概率分布的方法,以考虑基因型呼叫不确定性。基于这个想法,我们提出了一种从下一代测序数据中量化群体遗传分化的新方法。此外,我们还提出了一种通过主成分分析研究群体结构的策略。通过广泛的模拟,我们将本文提出的新方法与基于基因型调用的方法进行了比较,并证明了在广泛的条件下,估计精度有了显著提高。我们将该方法应用于在低覆盖度下测序的家养和野生蚕的大规模基因组数据集。我们发现我们可以推断出采样个体的精细遗传结构,这表明采用这种新方法对于研究在低覆盖度下采样的群体的遗传关系是有用的。