State Key Laboratory of Biocontrol and Guangdong Key Laboratory of Plant Resources, Sun Yat-sen University, 135 Xingang West Road, Guangzhou 510275, China.
BMC Genomics. 2013 Aug 7;14:535. doi: 10.1186/1471-2164-14-535.
As the error rate is high and the distribution of errors across sites is non-uniform in next generation sequencing (NGS) data, it has been a challenge to estimate DNA polymorphism (θ) accurately from NGS data.
By computer simulations, we compare the two methods of data acquisition - sequencing each diploid individual separately and sequencing the pooled sample. Under the current NGS error rate, sequencing each individual separately offers little advantage unless the coverage per individual is high (>20X). We hence propose a new method for estimating θ from pooled samples that have been subjected to two separate rounds of DNA sequencing. Since errors from the two sequencing applications are usually non-overlapping, it is possible to separate low frequency polymorphisms from sequencing errors. Simulation results show that the dual applications method is reliable even when the error rate is high and θ is low.
In studies of natural populations where the sequencing coverage is usually modest (~2X per individual), the dual applications method on pooled samples should be a reasonable choice.
由于下一代测序(NGS)数据中的错误率较高,且错误在各站点的分布不均匀,因此准确估计 DNA 多态性(θ)一直是一个挑战。
通过计算机模拟,我们比较了两种数据采集方法 - 分别对每个二倍体个体进行测序和对混合样本进行测序。在当前的 NGS 错误率下,除非每个个体的覆盖度很高(>20X),否则分别对每个个体进行测序几乎没有优势。因此,我们提出了一种从经过两轮独立 DNA 测序的混合样本中估计θ的新方法。由于来自两种测序应用的错误通常不重叠,因此可以将低频多态性与测序错误区分开来。模拟结果表明,即使在错误率高且θ值低的情况下,双重应用方法也是可靠的。
在测序覆盖率通常适中(每个个体约 2X)的自然种群研究中,混合样本的双重应用方法应该是一个合理的选择。