Centro de Biologia Molecular "Severo Ochoa" (CBMSO), CSIC-UAM Cantoblanco, 28049 Madrid, Spain.
Bioinformatics Facility CBMSO, CSIC-UAM Cantoblanco, 28049 Madrid, Spain.
Bioinformatics. 2023 Nov 1;39(11). doi: 10.1093/bioinformatics/btad630.
Evolutionary inference depends crucially on the quality of multiple sequence alignments (MSA), which is problematic for distantly related proteins. Since protein structure is more conserved than sequence, it seems natural to use structure alignments for distant homologs. However, structure alignments may not be suitable for inferring evolutionary relationships.
Here we examined four protein similarity measures that depend on sequence and structure (fraction of aligned residues, sequence identity, fraction of superimposed residues, and contact overlap), finding that they are intimately correlated but none of them provides a complete and unbiased picture of conservation in proteins. Therefore, we propose the new hybrid protein sequence and structure similarity score PC_sim based on their main principal component. The corresponding divergence measure PC_div shows the strongest correlation with divergences obtained from individual similarities, suggesting that it infers accurate evolutionary divergences. We developed the program PC_ali that constructs protein MSAs either de novo or modifying an input MSA, using a similarity matrix based on PC_sim. The program constructs a starting MSA based on the maximal cliques of the graph of these PAs and it refines it through progressive alignments along the tree reconstructed with PC_div. Compared with eight state-of-the-art multiple structure or sequence alignment tools, PC_ali achieves higher or equal aligned fraction and structural scores, sequence identity higher than structure aligners although lower than sequence aligners, highest score PC_sim, and highest similarity with the MSAs produced by other tools and with the reference MSA Balibase.
进化推断在很大程度上取决于多序列比对 (MSA) 的质量,而对于远缘蛋白质来说,这是一个问题。由于蛋白质结构比序列更保守,因此使用结构比对来推断远源同源物似乎是合理的。然而,结构比对可能并不适合推断进化关系。
在这里,我们检查了四种依赖于序列和结构的蛋白质相似性度量(对齐残基数的分数、序列同一性、重叠残基数的分数和接触重叠),发现它们密切相关,但没有一种能够完整和无偏地描述蛋白质的保守性。因此,我们提出了新的混合蛋白质序列和结构相似性评分 PC_sim,基于它们的主要主成分。相应的分歧度量 PC_div 与从单个相似性获得的分歧显示出最强的相关性,表明它推断出准确的进化分歧。我们开发了程序 PC_ali,该程序可以从头构建蛋白质 MSAs 或修改输入的 MSA,使用基于 PC_sim 的相似性矩阵。该程序基于这些 PA 的图的最大团构建起始 MSA,并通过沿着使用 PC_div 重建的树进行渐进对齐来对其进行细化。与八个最先进的多结构或序列比对工具相比,PC_ali 实现了更高或相等的对齐分数和结构分数,序列同一性高于结构比对器,尽管低于序列比对器,最高的 PC_sim 评分,以及与其他工具生成的 MSAs 和 Balibase 参考 MSA 的最高相似性。