Wu Steven H, Rodrigo Allen G
Biodesign Institute, Arizona State University, Tempe, AZ, 85287, USA.
Department of Biology, Duke University, Box 90338, Durham, NC, 27708, USA.
BMC Bioinformatics. 2015 Nov 4;16:357. doi: 10.1186/s12859-015-0810-y.
Over the last decade, next generation sequencing (NGS) has become widely available, and is now the sequencing technology of choice for most researchers. Nonetheless, NGS presents a challenge for the evolutionary biologists who wish to estimate evolutionary genetic parameters from a mixed sample of unlabelled or untagged individuals, especially when the reconstruction of full length haplotypes can be unreliable. We propose two novel approaches, least squares estimation (LS) and Approximate Bayesian Computation Markov chain Monte Carlo estimation (ABC-MCMC), to infer evolutionary genetic parameters from a collection of short-read sequences obtained from a mixed sample of anonymous DNA using the frequencies of nucleotides at each site only without reconstructing the full-length alignment nor the phylogeny.
We used simulations to evaluate the performance of these algorithms, and our results demonstrate that LS performs poorly because bootstrap 95% Confidence Intervals (CIs) tend to under- or over-estimate the true values of the parameters. In contrast, ABC-MCMC 95% Highest Posterior Density (HPD) intervals recovered from ABC-MCMC enclosed the true parameter values with a rate approximately equivalent to that obtained using BEAST, a program that implements a Bayesian MCMC estimation of evolutionary parameters using full-length sequences. Because there is a loss of information with the use of sitewise nucleotide frequencies alone, the ABC-MCMC 95% HPDs are larger than those obtained by BEAST.
We propose two novel algorithms to estimate evolutionary genetic parameters based on the proportion of each nucleotide. The LS method cannot be recommended as a standalone method for evolutionary parameter estimation. On the other hand, parameters recovered by ABC-MCMC are comparable to those obtained using BEAST, but with larger 95% HPDs. One major advantage of ABC-MCMC is that computational time scales linearly with the number of short-read sequences, and is independent of the number of full-length sequences in the original data. This allows us to perform the analysis on NGS datasets with large numbers of short read fragments. The source code for ABC-MCMC is available at https://github.com/stevenhwu/SF-ABC.
在过去十年中,下一代测序(NGS)已广泛应用,如今是大多数研究人员的首选测序技术。尽管如此,NGS给希望从未标记或未加标签个体的混合样本中估计进化遗传参数的进化生物学家带来了挑战,尤其是当全长单倍型的重建可能不可靠时。我们提出了两种新方法,即最小二乘法估计(LS)和近似贝叶斯计算马尔可夫链蒙特卡罗估计(ABC-MCMC),以仅使用每个位点核苷酸的频率从匿名DNA混合样本获得的短读序列集合中推断进化遗传参数,而无需重建全长比对或系统发育。
我们使用模拟来评估这些算法的性能,结果表明LS表现不佳,因为自展95%置信区间(CI)往往会低估或高估参数的真实值。相比之下,从ABC-MCMC恢复的95%最高后验密度(HPD)区间包含真实参数值的比率与使用BEAST获得的比率大致相当,BEAST是一个使用全长序列对进化参数进行贝叶斯MCMC估计的程序。由于仅使用位点核苷酸频率会导致信息丢失,ABC-MCMC的95%HPD比BEAST获得的更大。
我们提出了两种基于每个核苷酸比例估计进化遗传参数的新算法。LS方法不能作为进化参数估计的独立方法推荐。另一方面,ABC-MCMC恢复的参数与使用BEAST获得的参数相当,但95%HPD更大。ABC-MCMC的一个主要优点是计算时间与短读序列数量呈线性关系,并且与原始数据中全长序列的数量无关。这使我们能够对具有大量短读片段的NGS数据集进行分析。ABC-MCMC的源代码可在https://github.com/stevenhwu/SF-ABC获取。