Selberg Avery, Clark Nathan L, Sackton Timothy B, Muse Spencer V, Lucaci Alexander G, Weaver Steven, Nekrutenko Anton, Chikina Maria, Pond Sergei L Kosakovsky
Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, PA, USA.
Department of Biology, Temple University, Philadelphia, PA, USA.
bioRxiv. 2025 Mar 21:2024.11.13.620707. doi: 10.1101/2024.11.13.620707.
Positive selection is an evolutionary process which increases the frequency of advantageous mutations because they confer a fitness benefit. Inferring the past action of positive selection on protein-coding sequences is fundamental for deciphering phenotypic diversity and the emergence of novel traits. With the advent of genome-wide comparative genomic datasets, researchers can analyze selection not only at the level of individual genes but also globally, delivering systems-level insights into evolutionary dynamics. However, genome-scale datasets are generated with automated pipelines and imperfect curation that does not eliminate all sequencing, annotation, and alignment errors. Positive selection inference methods are highly sensitive to such errors. We present BUSTED-E: a method designed to detect positive selection for amino acid diversification while concurrently identifying some alignment errors. This method builds on the flexible branch-site random effects model (BUSTED) for fitting distributions of dN/dS, with a critical modification: it incorporates an "error-sink" component to represent an abiological evolutionary regime. Using several genome-scale biological datasets that were extensively filtered using state-of-the art automated alignment tools, we show that BUSTED-E identifies pervasive residual alignment errors, produces more realistic estimates of positive selection, reduces bias, and improves biological interpretation. The BUSTED-E model promises to be a more stringent filter to identify positive selection in genome-wide contexts, thus enabling further characterization and validation of the most biologically relevant cases.
正向选择是一个进化过程,它会增加有利突变的频率,因为这些突变会带来适应性优势。推断过去正向选择对蛋白质编码序列的作用,对于解读表型多样性和新性状的出现至关重要。随着全基因组比较基因组数据集的出现,研究人员不仅可以在单个基因层面分析选择,还能进行全局分析,从而提供系统层面的进化动力学见解。然而,基因组规模的数据集是通过自动化流程生成的,且整理并不完善,无法消除所有测序、注释和比对错误。正向选择推断方法对这类错误高度敏感。我们提出了BUSTED-E:一种旨在检测氨基酸多样化的正向选择,同时识别一些比对错误的方法。该方法基于灵活的分支位点随机效应模型(BUSTED)来拟合dN/dS的分布,并进行了关键修改:它纳入了一个“错误汇”组件来代表非生物进化模式。使用几个经过最先进的自动比对工具广泛过滤的基因组规模生物数据集,我们表明BUSTED-E能够识别普遍存在的残留比对错误,产生更现实的正向选择估计值,减少偏差,并改善生物学解释。BUSTED-E模型有望成为在全基因组背景下识别正向选择的更严格过滤器,从而能够进一步表征和验证最具生物学相关性的案例。