Soni Vivak, Johri Parul, Jensen Jeffrey D
School of Life Sciences, Arizona State University, Tempe, AZ, USA.
Present address: Department of Biology, Department of Genetics, University of North Carolina, Chapel Hill, NC, USA.
bioRxiv. 2023 Jun 15:2023.06.15.545166. doi: 10.1101/2023.06.15.545166.
The detection of selective sweeps from population genomic data often relies on the premise that the beneficial mutations in question have fixed very near the sampling time. As it has been previously shown that the power to detect a selective sweep is strongly dependent on the time since fixation as well as the strength of selection, it is naturally the case that strong, recent sweeps leave the strongest signatures. However, the biological reality is that beneficial mutations enter populations at a rate, one that partially determines the mean wait time between sweep events and hence their age distribution. An important question thus remains about the power to detect recurrent selective sweeps when they are modelled by a realistic mutation rate and as part of a realistic distribution of fitness effects (DFE), as opposed to a single, recent, isolated event on a purely neutral background as is more commonly modelled. Here we use forward-in-time simulations to study the performance of commonly used sweep statistics, within the context of more realistic evolutionary baseline models incorporating purifying and background selection, population size change, and mutation and recombination rate heterogeneity. Results demonstrate the important interplay of these processes, necessitating caution when interpreting selection scans; specifically, false positive rates are in excess of true positive across much of the evaluated parameter space, and selective sweeps are often undetectable unless the strength of selection is exceptionally strong.
Outlier-based genomic scans have proven a popular approach for identifying loci that have potentially experienced recent positive selection. However, it has previously been shown that an evolutionarily appropriate baseline model that incorporates non-equilibrium population histories, purifying and background selection, and variation in mutation and recombination rates is necessary to reduce often extreme false positive rates when performing genomic scans. Here we evaluate the power to detect recurrent selective sweeps using common SFS-based and haplotype-based methods under these increasingly realistic models. We find that while these appropriate evolutionary baselines are essential to reduce false positive rates, the power to accurately detect recurrent selective sweeps is generally low across much of the biologically relevant parameter space.
从群体基因组数据中检测选择性清除通常依赖于这样一个前提,即所讨论的有益突变在非常接近采样时间时已经固定下来。正如之前所表明的,检测选择性清除的能力强烈依赖于自固定以来的时间以及选择强度,所以自然而然地,强烈且近期的清除会留下最强的信号。然而,生物学现实是有益突变以一定速率进入群体,这一速率部分决定了清除事件之间的平均等待时间,进而决定了它们的年龄分布。因此,一个重要的问题仍然存在,即当通过现实的突变率并作为现实的适合度效应分布(DFE)的一部分来模拟反复发生的选择性清除时,与更常见模拟的纯粹中性背景下的单个、近期、孤立事件相比,检测其的能力如何。在这里,我们使用时间向前模拟来研究常用清除统计量的性能,该模拟处于更现实的进化基线模型背景下,该模型纳入了纯化和背景选择、群体大小变化以及突变和重组率异质性。结果表明了这些过程之间的重要相互作用,在解释选择扫描时需要谨慎;具体而言,在大部分评估的参数空间中,假阳性率超过了真阳性率,并且除非选择强度异常强,否则选择性清除通常无法检测到。
基于离群值的基因组扫描已被证明是一种识别可能经历近期正选择的基因座的流行方法。然而,之前已经表明,在进行基因组扫描时,纳入非平衡群体历史、纯化和背景选择以及突变和重组率变化的进化上合适的基线模型对于降低通常极高的假阳性率是必要的。在这里,我们在这些日益现实的模型下,使用基于常见单核苷酸多态性频率谱(SFS)和单倍型的方法评估检测反复发生的选择性清除的能力。我们发现,虽然这些合适的进化基线对于降低假阳性率至关重要,但在大部分生物学相关参数空间中,准确检测反复发生的选择性清除的能力通常较低。