Rykunov Dmitry, Fiser András
Department of Biochemistry, Seaver Center for Bioinformatics, Albert Einstein College of Medicine, Bronx, New York 10461, USA.
Proteins. 2007 May 15;67(3):559-68. doi: 10.1002/prot.21279.
Statistical distance dependent pair potentials are frequently used in a variety of folding, threading, and modeling studies of proteins. The applicability of these types of potentials is tightly connected to the reliability of statistical observations. We explored the possible origin and extent of false positive signals in statistical potentials by analyzing their distance dependence in a variety of randomized protein-like models. While on average potentials derived from such models are expected to equal zero at any distance, we demonstrate that systematic and significant distortions exist. These distortions originate from the limited statistical counts in local environments of proteins and from the limited size of protein structures at large distances. We suggest that these systematic errors in statistical potentials are connected to the dependence of amino acid composition on protein size and to variation in protein sizes. Additionally, atom-based potentials are dominated by a false positive signal that is due to correlation among distances measured from atoms of one residue to atoms of another residue. The significance of residue-based pairwise potentials at various spatial pair separations was assessed in this study and it was found that as few as approximately 50% of potential values were statistically significant at distances below 4 A, and only at most approximately 80% of them were significant at larger pair separations. A new definition for reference state, free of the observed systematic errors, is suggested. It has been demonstrated to generate statistical potentials that compare favorably to other publicly available ones.
统计距离相关的对势在蛋白质的各种折叠、穿线和建模研究中经常被使用。这些类型的势的适用性与统计观测的可靠性紧密相关。我们通过分析各种随机化的类蛋白质模型中它们的距离依赖性,探索了统计势中假阳性信号的可能来源和程度。虽然平均而言,从这类模型导出的势在任何距离下都预期等于零,但我们证明存在系统性的显著偏差。这些偏差源于蛋白质局部环境中有限的统计计数以及远距离处蛋白质结构的有限大小。我们认为统计势中的这些系统误差与氨基酸组成对蛋白质大小的依赖性以及蛋白质大小的变化有关。此外,基于原子的势由一个假阳性信号主导,该信号是由于从一个残基的原子到另一个残基的原子所测量距离之间的相关性。在本研究中评估了各种空间对间距下基于残基的成对势的显著性,发现距离低于4埃时,仅有约50%的势值具有统计学显著性,而在更大的对间距时,最多只有约80%的势值具有显著性。提出了一种新的参考态定义,该定义没有观测到的系统误差。已证明它所生成的统计势与其他公开可用的势相比具有优势。