Zheng Fan, Grigoryan Gevorg
Department of Biological Sciences, Dartmouth College, Hanover, NH, United States of America.
Department of Computer Science, Dartmouth College, Hanover, NH, United States of America.
PLoS One. 2017 May 26;12(5):e0178272. doi: 10.1371/journal.pone.0178272. eCollection 2017.
The Protein Data Bank (PDB) has been a key resource for learning general rules of sequence-structure relationships in proteins. Quantitative insights have been gained by defining geometric descriptors of structure (e.g., distances, dihedral angles, solvent exposure, etc.) and observing their distributions and sequence preferences. Here we argue that as the PDB continues to grow, it may become unnecessary to reduce structure into a set of elementary descriptors. Instead, it could be possible to deduce quantitative sequence-structure relationships in the context of precisely-defined complex structural motifs by mining the PDB for closely matching backbone geometries. To validate this idea, we turned to the the task of predicting changes in protein stability upon amino-acid substitution-a difficult problem of broad significance. We defined non-contiguous tertiary motifs (TERMs) around a protein site of interest and extracted sequence preferences from ensembles of closely-matching substructures in the PDB to predict mutational stability changes at the site, ΔΔGm. We demonstrate that these ensemble statistics predict ΔΔGm on par with state-of-the-art statistical and machine-learning methods on large thermodynamic datasets, and outperform these, along with a leading structure-based modeling approach, when tested in the context of unbiased diverse mutations. Further, we show that the performance of the TERM-based method is directly related to the amount of available relevant structural data, automatically improving with the growing PDB. This enables a means of estimating prediction accuracy. Our results clearly demonstrate that: 1) statistics of non-contiguous structural motifs in the PDB encode fundamental sequence-structure relationships related to protein thermodynamic stability, and 2) the PDB is now large enough that such statistics are already useful in practice, with their accuracy expected to continue increasing as the database grows. These observations suggest new ways of using structural data towards addressing problems of computational structural biology.
蛋白质数据库(PDB)一直是了解蛋白质序列 - 结构关系一般规则的关键资源。通过定义结构的几何描述符(例如,距离、二面角、溶剂暴露等)并观察它们的分布和序列偏好,已获得了定量的见解。在这里,我们认为随着PDB的不断增长,可能没有必要将结构简化为一组基本描述符。相反,通过在PDB中挖掘紧密匹配的主链几何结构,有可能在精确定义的复杂结构基序的背景下推断定量的序列 - 结构关系。为了验证这一想法,我们转向预测氨基酸取代后蛋白质稳定性变化的任务——这是一个具有广泛意义的难题。我们在感兴趣的蛋白质位点周围定义了非连续的三级基序(TERMs),并从PDB中紧密匹配的子结构集合中提取序列偏好,以预测该位点的突变稳定性变化ΔΔGm。我们证明,这些集合统计数据在大型热力学数据集上预测ΔΔGm的能力与最先进的统计和机器学习方法相当,并且在无偏多样突变的背景下进行测试时,优于这些方法以及领先的基于结构的建模方法。此外,我们表明基于TERM的方法的性能与可用相关结构数据的量直接相关,并随着PDB的增长而自动提高。这提供了一种估计预测准确性的方法。我们的结果清楚地表明:1)PDB中非连续结构基序的统计数据编码了与蛋白质热力学稳定性相关的基本序列 - 结构关系,2)PDB现在已经足够大,以至于这样的统计数据在实践中已经有用,并且随着数据库的增长,其准确性有望继续提高。这些观察结果提出了利用结构数据解决计算结构生物学问题的新方法。