Forni Giobbe, Ruggieri Angelo Alberto, Piccinini Giovanni, Luchetti Andrea
BiGeA Department University of Bologna Bologna Italy.
Department of Biology University of Puerto Rico-Rio Piedras San Juan Puerto Rico.
Ecol Evol. 2021 Sep 3;11(19):13029-13035. doi: 10.1002/ece3.7959. eCollection 2021 Oct.
Inferring the selective forces that orthologous genes underwent across different lineages can help us understand the evolutionary processes that have shaped their extant diversity and the phenotypes they underlie. The most widespread metric to estimate the selection regimes of coding genes-across sites and phylogenies-is the ratio of nonsynonymous to synonymous substitutions (d/d, also known as ). Nowadays, modern sequencing technologies and the large amount of already available sequence data allow the retrieval of thousands of orthologous genes across large numbers of species. Nonetheless, the tools available to explore selection regimes are not designed to automatically process all genes, and their practical usage is often restricted to the single-copy ones which are found across all species considered (i.e., ubiquitous genes). This approach limits the scale of the analysis to a fraction of single-copy genes, which can be as low as an order of magnitude in respect to those which are not consistently found in all species considered (i.e., nonubiquitous genes). Here, we present a workflow named BASE that-leveraging the CodeML framework-eases the inference and interpretation of gene selection regimes in the context of comparative genomics. Although a number of bioinformatics tools have already been developed to facilitate this kind of analyses, BASE is the first to be specifically designed to allow the integration of nonubiquitous genes in a straightforward and reproducible manner. The workflow-along with all relevant documentation-is available at github.com/for-giobbe/BASE.
推断直系同源基因在不同谱系中所经历的选择压力,有助于我们理解塑造其现存多样性及其所决定的表型的进化过程。估计编码基因在不同位点和系统发育中的选择模式最常用的指标是非同义替换与同义替换的比率(dN/dS,也称为ω)。如今,现代测序技术和大量已有的序列数据使得我们能够在大量物种中检索数千个直系同源基因。尽管如此,现有的用于探索选择模式的工具并非设计用于自动处理所有基因,其实际应用往往仅限于在所有被考虑物种中都存在的单拷贝基因(即普遍存在的基因)。这种方法将分析规模限制在单拷贝基因的一小部分,相对于那些并非在所有被考虑物种中都一致存在的基因(即非普遍存在的基因),这一比例可能低至一个数量级。在此,我们提出了一种名为BASE的工作流程,该流程利用CodeML框架,在比较基因组学的背景下简化了基因选择模式的推断和解释。尽管已经开发了许多生物信息学工具来促进这类分析,但BASE是第一个专门设计用于以直接且可重复的方式整合非普遍存在基因的工具。该工作流程以及所有相关文档可在github.com/for-giobbe/BASE获取。