Postic Guillaume, Janel Nathalie, Moroy Gautier
Université de Paris, BFA, UMR 8251, CNRS, ERL U1133, Inserm, F-75013 Paris, France.
Université de Paris, BFA, UMR 8251, CNRS, F-75013 Paris, France.
Comput Struct Biotechnol J. 2021 Apr 28;19:2618-2625. doi: 10.1016/j.csbj.2021.04.049. eCollection 2021.
The recent breakthrough in the field of protein structure prediction shows the relevance of using knowledge-based based scoring functions in combination with a low-resolution 3D representation of protein macromolecules. The choice of not using all atoms is barely supported by any data in the literature, and is mostly motivated by empirical and practical reasons, such as the computational cost of assessing the numerous folds of the protein conformational space. Here, we present a comprehensive study, carried on a large and balanced benchmark of predicted protein structures, to see how different types of structural representations rank in either accuracy or calculation speed, and which ones offer the best compromise between these two criteria. We tested ten representations, including low-resolution, high-resolution, and coarse-grained approaches. We also investigated the generalization of the findings to other formalisms than the widely-used "potential of mean force" (PMF) method. Thus, we observed that representing protein structures by their β carbons-combined or not with Cα-provides the best speedaccuracy trade-off, when using a "total information gain" scoring function. For statistical PMFs, using MARTINI backbone and side-chains beads is the best option. Finally, we also demonstrated the necessity of training the reference state on all atom types, and of including the Cα atoms of glycine residues, in a Cβ-based representation.
蛋白质结构预测领域的最新突破表明,将基于知识的评分函数与蛋白质大分子的低分辨率三维表示相结合具有重要意义。不使用所有原子的选择在文献中几乎没有得到任何数据支持,其主要动机是经验和实际原因,例如评估蛋白质构象空间众多折叠的计算成本。在此,我们对大量且平衡的预测蛋白质结构基准进行了一项全面研究,以了解不同类型的结构表示在准确性或计算速度方面的排名情况,以及哪些表示在这两个标准之间提供了最佳折衷方案。我们测试了十种表示方法,包括低分辨率、高分辨率和粗粒度方法。我们还研究了这些发现对除广泛使用的“平均力势”(PMF)方法之外的其他形式主义的通用性。因此,我们观察到,当使用“总信息增益”评分函数时,用β碳原子(结合或不结合Cα)表示蛋白质结构能提供最佳的速度-准确性权衡。对于统计PMF,使用MARTINI主链和侧链珠子是最佳选择。最后,我们还证明了在基于Cβ的表示中,对所有原子类型训练参考状态以及包括甘氨酸残基的Cα原子的必要性。