Department of Computer Science and Engineering, University of Bologna, Bologna 40126, Italy.
Department of Computer Science, University of California, Irvine, CA 92697, USA.
Bioinformatics. 2021 May 1;37(4):506-513. doi: 10.1093/bioinformatics/btaa833.
Protein fold recognition is a key step for template-based modeling approaches to protein structure prediction. Although closely related folds can be easily identified by sequence homology search in sequence databases, fold recognition is notoriously more difficult when it involves the identification of distantly related homologs. Recent progress in residue-residue contact and distance prediction opens up the possibility of improving fold recognition by using structural information contained in predicted distance and contact maps.
Here we propose to use the congruence coefficient as a metric of similarity between maps. We prove that this metric has several interesting mathematical properties which allow one to compute in polynomial time its exact mean and variance over all possible (exponentially many) alignments between two symmetric matrices, and assess the statistical significance of similarity between aligned maps. We perform fold recognition tests by recovering predicted target contact/distance maps from the two most recent Critical Assessment of Structure Prediction editions and over 27 000 non-homologous structural templates from the ECOD database. On this large benchmark, we compare fold recognition performances of different alignment tools with their own similarity scores against those obtained using the congruence coefficient. We show that the congruence coefficient overall improves fold recognition over other methods, proving its effectiveness as a general similarity metric for protein map comparison.
The congruence coefficient software CCpro is available as part of the SCRATCH suite at: http://scratch.proteomics.ics.uci.edu/.
Supplementary data are available at Bioinformatics online.
蛋白质折叠识别是基于模板的蛋白质结构预测方法的关键步骤。尽管在序列数据库中通过序列同源性搜索可以轻松识别密切相关的折叠,但当涉及到识别远距离同源物时,折叠识别就变得更加困难。最近在残基残基接触和距离预测方面的进展为利用预测距离和接触图中包含的结构信息来改进折叠识别开辟了可能性。
在这里,我们建议使用一致性系数作为地图之间相似性的度量。我们证明了该度量具有几个有趣的数学性质,允许在两个对称矩阵之间的所有可能(指数级多)对齐中计算其精确均值和方差,并评估对齐地图之间相似性的统计显著性。我们通过从最近的两次关键结构预测评估版和 ECOD 数据库中的 27000 多个非同源结构模板中恢复预测的目标接触/距离图来进行折叠识别测试。在这个大型基准测试中,我们将不同对齐工具的折叠识别性能与其自身相似性得分与使用一致性系数获得的得分进行了比较。我们表明,一致性系数总体上提高了折叠识别的性能,优于其他方法,证明了它作为蛋白质图谱比较的通用相似性度量的有效性。
一致性系数软件 CCpro 可作为 SCRATCH 套件的一部分在以下网址获得:http://scratch.proteomics.ics.uci.edu/。
补充数据可在 Bioinformatics 在线获得。