National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD, 20894, USA.
Biol Direct. 2022 Aug 30;17(1):22. doi: 10.1186/s13062-022-00337-7.
Evolutionary rate is a key characteristic of gene families that is linked to the functional importance of the respective genes as well as specific biological functions of the proteins they encode. Accurate estimation of evolutionary rates is a challenging task that requires precise phylogenetic analysis. Here we present an easy to estimate protein family level measure of sequence variability based on alignment column homogeneity in multiple alignments of protein sequences from Clade-Specific Clusters of Orthologous Genes (csCOGs).
We report genome-wide estimates of variability for 8 diverse groups of bacteria and archaea and investigate the connection between variability and various genomic and biological features. The variability estimates are based on homogeneity distributions across amino acid sequence alignments and can be obtained for multiple groups of genomes at minimal computational expense. About half of the variance in variability values can be explained by the analyzed features, with the greatest contribution coming from the extent of gene paralogy in the given csCOG. The correlation between variability and paralogy appears to originate, primarily, not from gene duplication, but from acquisition of distant paralogs and xenologs, introducing sequence variants that are more divergent than those that could have evolved in situ during the lifetime of the given group of organisms. Both high-variability and low-variability csCOGs were identified in all functional categories, but as expected, proteins encoded by integrated mobile elements as well as proteins involved in defense functions and cell motility are, on average, more variable than proteins with housekeeping functions. Additionally, using linear discriminant analysis, we found that variability and fraction of genomes carrying a given gene are the two variables that provide the best prediction of gene essentiality as compared to the results of transposon mutagenesis in Sulfolobus islandicus.
Variability, a measure of sequence diversity within an alignment relative to the overall diversity within a group of organisms, offers a convenient proxy for evolutionary rate estimates and is informative with respect to prediction of functional properties of proteins. In particular, variability is a strong predictor of gene essentiality for the respective organisms and indicative of sub- or neofunctionalization of paralogs.
进化率是基因家族的一个关键特征,与各自基因的功能重要性以及它们编码的蛋白质的特定生物学功能有关。准确估计进化率是一项具有挑战性的任务,需要精确的系统发育分析。在这里,我们提出了一种基于 Clade-Specific Clusters of Orthologous Genes(csCOGs)中蛋白质序列多重比对的对齐列同质性的简单易行的蛋白质家族水平序列变异性估计方法。
我们报告了 8 个不同细菌和古菌群体的全基因组变异性估计值,并研究了变异性与各种基因组和生物学特征之间的关系。这些变异性估计值基于氨基酸序列比对中的同质性分布,可以在最小的计算成本下获得多个基因组群体的估计值。在变异性值的方差中,约有一半可以用分析的特征来解释,最大的贡献来自于给定 csCOG 中的基因旁系同源程度。变异性与旁系同源之间的相关性似乎主要不是来自基因复制,而是来自远距离旁系同源和异源同源的获得,从而引入了比在给定生物群体的生命周期中在原位进化的那些更具差异性的序列变体。在所有功能类别中都鉴定出了高变异性和低变异性的 csCOG,但正如预期的那样,整合移动元件编码的蛋白质以及参与防御功能和细胞运动的蛋白质平均比具有管家功能的蛋白质更具变异性。此外,使用线性判别分析,我们发现与 Sulfolobus islandicus 中转座子诱变的结果相比,变异性和携带给定基因的基因组分数是预测基因必需性的两个最佳变量。
变异性是相对于一个群体中整体多样性的对齐内的序列多样性的度量,它提供了进化率估计的便利替代指标,并为蛋白质功能特性的预测提供了信息。特别是,变异性是对各自生物体的基因必需性的强预测指标,并且指示了旁系同源的亚功能化或新功能化。