Dias Raquel, Kolaczkowski Bryan
Department of Microbiology and Cell Science, University of Florida, Gainesville, FL, USA.
BMC Bioinformatics. 2017 Mar 23;18(Suppl 5):102. doi: 10.1186/s12859-017-1533-z.
One goal of structural biology is to understand how a protein's 3-dimensional conformation determines its capacity to interact with potential ligands. In the case of small chemical ligands, deconstructing a static protein-ligand complex into its constituent atom-atom interactions is typically sufficient to rapidly predict ligand affinity with high accuracy (>70% correlation between predicted and experimentally-determined affinity), a fact that is exploited to support structure-based drug design. We recently found that protein-DNA/RNA affinity can also be predicted with high accuracy using extensions of existing techniques, but protein-protein affinity could not be predicted with >60% correlation, even when the protein-protein complex was available.
X-ray and NMR structures of protein-protein complexes, their associated binding affinities and experimental conditions were obtained from different binding affinity and structural databases. Statistical models were implemented using a generalized linear model framework, including the experimental conditions as new model features. We evaluated the potential for new features to improve affinity prediction models by calculating the Pearson correlation between predicted and experimental binding affinities on the training and test data after model fitting and after cross-validation. Differences in accuracy were assessed using two-sample t test and nonparametric Mann-Whitney U test.
Here we evaluate a range of potential factors that may interfere with accurate protein-protein affinity prediction. We find that X-ray crystal resolution has the strongest single effect on protein-protein affinity prediction. Limiting our analyses to only high-resolution complexes (≤2.5 Å) increased the correlation between predicted and experimental affinity from 54 to 68% (p = 4.32x10). In addition, incorporating information on the experimental conditions under which affinities were measured (pH, temperature and binding assay) had significant effects on prediction accuracy. We also highlight a number of potential errors in large structure-affinity databases, which could affect both model training and accuracy assessment.
The results suggest that the accuracy of statistical models for protein-protein affinity prediction may be limited by the information present in databases used to train new models. Improving our capacity to integrate large-scale structural and functional information may be required to substantively advance our understanding of the general principles by which a protein's structure determines its function.
结构生物学的一个目标是了解蛋白质的三维构象如何决定其与潜在配体相互作用的能力。对于小分子化学配体而言,将静态的蛋白质 - 配体复合物解构为其组成原子间的相互作用,通常足以快速且高精度地预测配体亲和力(预测亲和力与实验测定亲和力之间的相关性>70%),这一事实被用于支持基于结构的药物设计。我们最近发现,利用现有技术的扩展也可以高精度地预测蛋白质 - DNA/RNA亲和力,但即使有蛋白质 - 蛋白质复合物,蛋白质 - 蛋白质亲和力的预测相关性也无法超过60%。
从不同的结合亲和力和结构数据库中获取蛋白质 - 蛋白质复合物的X射线和核磁共振结构、它们相关的结合亲和力以及实验条件。使用广义线性模型框架实施统计模型,将实验条件作为新的模型特征。在模型拟合后以及交叉验证后,通过计算训练数据和测试数据上预测结合亲和力与实验结合亲和力之间的皮尔逊相关性,我们评估了新特征改善亲和力预测模型的潜力。使用双样本t检验和非参数曼 - 惠特尼U检验评估准确性差异。
在此,我们评估了一系列可能干扰准确预测蛋白质 - 蛋白质亲和力的潜在因素。我们发现X射线晶体分辨率对蛋白质 - 蛋白质亲和力预测具有最强的单一影响。将分析仅限于高分辨率复合物(≤2.5 Å)可使预测亲和力与实验亲和力之间的相关性从54%提高到68%(p = 4.32x10)。此外,纳入亲和力测量时的实验条件信息(pH、温度和结合测定)对预测准确性有显著影响。我们还强调了大型结构 - 亲和力数据库中存在的一些潜在错误,这些错误可能会影响模型训练和准确性评估。
结果表明,用于预测蛋白质 - 蛋白质亲和力的统计模型的准确性可能受到用于训练新模型的数据库中所存在信息的限制。可能需要提高我们整合大规模结构和功能信息的能力,以实质性地推进我们对蛋白质结构决定其功能的一般原则的理解。