Phillips Jarrett D, French Steven H, Hanner Robert H, Gillis Daniel J
School of Computer Science, University of Guelph, Guelph, Ontario, Canada.
Department of Integrative Biology, Biodiversity Institute of Ontario, University of Guelph, Guelph, Ontario, Canada.
PeerJ Comput Sci. 2020 Jan 6;6:e243. doi: 10.7717/peerj-cs.243. eCollection 2020.
Assessing levels of standing genetic variation within species requires a robust sampling for the purpose of accurate specimen identification using molecular techniques such as DNA barcoding; however, statistical estimators for what constitutes a robust sample are currently lacking. Moreover, such estimates are needed because most species are currently represented by only one or a few sequences in existing databases, which can safely be assumed to be undersampled. Unfortunately, sample sizes of 5-10 specimens per species typically seen in DNA barcoding studies are often insufficient to adequately capture within-species genetic diversity. Here, we introduce a novel iterative extrapolation simulation algorithm of haplotype accumulation curves, called HACSim (aplotype ccumulation urve ulator) that can be employed to calculate likely sample sizes needed to observe the full range of DNA barcode haplotype variation that exists for a species. Using uniform haplotype and non-uniform haplotype frequency distributions, the notion of sampling sufficiency (the sample size at which sampling accuracy is maximized and above which no new sampling information is likely to be gained) can be gleaned. HACSim can be employed in two primary ways to estimate specimen sample sizes: (1) to simulate haplotype sampling in hypothetical species, and (2) to simulate haplotype sampling in real species mined from public reference sequence databases like the Barcode of Life Data Systems (BOLD) or GenBank for any genomic marker of interest. While our algorithm is globally convergent, runtime is heavily dependent on initial sample sizes and skewness of the corresponding haplotype frequency distribution.
评估物种内现存遗传变异水平需要进行充分采样,以便利用DNA条形码等分子技术准确鉴定标本;然而,目前尚缺乏关于构成充分样本的统计估计方法。此外,之所以需要这样的估计,是因为在现有数据库中,大多数物种目前仅由一个或几个序列代表,可以肯定地认为这些样本采样不足。不幸的是,DNA条形码研究中常见的每个物种5 - 10个标本的样本量往往不足以充分捕捉物种内的遗传多样性。在此,我们引入一种新的单倍型累积曲线迭代外推模拟算法,称为HACSim(单倍型累积曲线模拟器),可用于计算观察一个物种完整的DNA条形码单倍型变异范围所需的可能样本量。利用均匀单倍型和非均匀单倍型频率分布,可以得出采样充足性的概念(即采样精度最大化且超过此样本量不太可能获得新采样信息时的样本量)。HACSim可通过两种主要方式用于估计标本样本量:(1)模拟假设物种中的单倍型采样,以及(2)模拟从公共参考序列数据库(如生命条形码数据系统(BOLD)或GenBank)中挖掘的真实物种中针对任何感兴趣的基因组标记的单倍型采样。虽然我们的算法全局收敛,但运行时间在很大程度上取决于初始样本量和相应单倍型频率分布的偏度。