蛋白质结构数据库对统计对势预测能力的影响。

Influence of protein structure databases on the predictive power of statistical pair potentials.

作者信息

Furuichi E, Koehl P

机构信息

CNRS, Illkirch Graffenstaden, France.

出版信息

Proteins. 1998 May 1;31(2):139-49. doi: 10.1002/(sici)1097-0134(19980501)31:2<139::aid-prot4>3.0.co;2-h.

DOI:10.1002/(sici)1097-0134(19980501)31:2<139::aid-prot4>3.0.co;2-h

PMID:9593188

Abstract

A long standing goal in protein structure studies is the development of reliable energy functions that can be used both to verify protein models derived from experimental constraints as well as for theoretical protein folding and inverse folding computer experiments. In that respect, knowledge-based statistical pair potentials have attracted considerable interests recently mainly because they include the essential features of protein structures as well as solvent effects at a low computing cost. However, the basis on which statistical potentials are derived have been questioned. In this paper, we investigate statistical pair potentials derived from protein three-dimensional structures, addressing in particular questions related to the form of these potentials, as well as to the content of the database from which they are derived. We have shown that statistical pair potentials depend on the size of the proteins included in the database, and that this dependence can be reduced by considering only pairs of residue close in space (i.e., with a cutoff of 8 A). We have shown also that statistical potentials carry a memory of the quality of the database in terms of the amount and diversity of secondary structure it contains. We find, for example, that potentials derived from a database containing alpha-proteins will only perform best on alpha-proteins in fold recognition computer experiments. We believe that this is an overall weakness of these potentials, which must be kept in mind when constructing a database.

摘要

蛋白质结构研究的一个长期目标是开发可靠的能量函数，该函数既能用于验证从实验约束推导出来的蛋白质模型，也能用于理论蛋白质折叠和反向折叠计算机实验。在这方面，基于知识的统计对势最近引起了相当大的兴趣，主要是因为它们以较低的计算成本包含了蛋白质结构的基本特征以及溶剂效应。然而，统计势的推导基础受到了质疑。在本文中，我们研究了从蛋白质三维结构推导出来的统计对势，特别关注与这些势的形式以及推导它们所依据的数据库内容相关的问题。我们已经表明，统计对势取决于数据库中所包含蛋白质的大小，并且通过仅考虑空间上接近的残基对（即截止距离为8埃），这种依赖性可以降低。我们还表明，统计势在其所含二级结构的数量和多样性方面保留了数据库质量的印记。例如，我们发现，在折叠识别计算机实验中，从包含α-蛋白质的数据库推导出来的势仅对α-蛋白质表现最佳。我们认为这是这些势的一个总体弱点，在构建数据库时必须牢记这一点。