Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, PA, USA.
Department of Ecology and Evolutionary Biology, University of Arizona, Tucson, Arizona.
Mol Biol Evol. 2023 Jul 5;40(7). doi: 10.1093/molbev/msad150.
Inference and interpretation of evolutionary processes, in particular of the types and targets of natural selection affecting coding sequences, are critically influenced by the assumptions built into statistical models and tests. If certain aspects of the substitution process (even when they are not of direct interest) are presumed absent or are modeled with too crude of a simplification, estimates of key model parameters can become biased, often systematically, and lead to poor statistical performance. Previous work established that failing to accommodate multinucleotide (or multihit, MH) substitutions strongly biases dN/dS-based inference towards false-positive inferences of diversifying episodic selection, as does failing to model variation in the rate of synonymous substitution (SRV) among sites. Here, we develop an integrated analytical framework and software tools to simultaneously incorporate these sources of evolutionary complexity into selection analyses. We found that both MH and SRV are ubiquitous in empirical alignments, and incorporating them has a strong effect on whether or not positive selection is detected (1.4-fold reduction) and on the distributions of inferred evolutionary rates. With simulation studies, we show that this effect is not attributable to reduced statistical power caused by using a more complex model. After a detailed examination of 21 benchmark alignments and a new high-resolution analysis showing which parts of the alignment provide support for positive selection, we show that MH substitutions occurring along shorter branches in the tree explain a significant fraction of discrepant results in selection detection. Our results add to the growing body of literature which examines decades-old modeling assumptions (including MH) and finds them to be problematic for comparative genomic data analysis. Because multinucleotide substitutions have a significant impact on natural selection detection even at the level of an entire gene, we recommend that selection analyses of this type consider their inclusion as a matter of routine. To facilitate this procedure, we developed, implemented, and benchmarked a simple and well-performing model testing selection detection framework able to screen an alignment for positive selection with two biologically important confounding processes: site-to-site synonymous rate variation, and multinucleotide instantaneous substitutions.
进化过程的推断和解释,特别是影响编码序列的自然选择的类型和目标,受到统计模型和检验中内置假设的严重影响。如果替代过程的某些方面(即使它们不是直接感兴趣的)被假定不存在或用过于粗糙的简化方式建模,则关键模型参数的估计可能会产生偏差,通常是系统性的,并导致统计性能不佳。以前的工作已经确定,未能适应多核苷酸(或多击,MH)取代强烈地偏向基于 dN/dS 的推断,导致对多样化的爆发性选择的假阳性推断,未能对位点之间同义替代率(SRV)的变化进行建模也是如此。在这里,我们开发了一个集成的分析框架和软件工具,将这些进化复杂性的来源同时纳入选择分析中。我们发现,MH 和 SRV 在经验性比对中普遍存在,并且将它们纳入其中对是否检测到正选择(降低 1.4 倍)以及推断的进化率分布有很强的影响。通过模拟研究,我们表明这种影响不是由于使用更复杂的模型而导致的统计能力降低所致。在对 21 个基准比对进行详细检查以及对新的高分辨率分析显示哪些比对部分为正选择提供支持之后,我们表明树中较短分支上发生的 MH 取代解释了选择检测中不一致结果的很大一部分。我们的结果增加了越来越多的文献,这些文献检查了几十年的建模假设(包括 MH),并发现它们对比较基因组数据分析有问题。由于多核苷酸取代即使在整个基因的水平上对自然选择检测也有重大影响,因此我们建议此类选择分析考虑将其作为常规内容。为了促进这一过程,我们开发、实施和基准测试了一个简单而性能良好的模型测试选择检测框架,该框架能够通过两种具有重要生物学意义的混杂过程来筛选比对中的正选择:位点间同义速率变化和多核苷酸瞬时取代。