Zhao Huiying, Wang Jihua, Zhou Yaoqi, Yang Yuedong
School of Informatics, Indiana University Purdue University, Indianapolis, Indiana, United States of America; Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, Indiana, United States of America; QIMR Berghofer Medical Research Institute, Brisbane, Queensland, Australia.
Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, Indiana, United States of America; Shandong Provincial Key Laboratory of Functional Macromolecular Biophysics, Dezhou University, Dezhou, Shandong, China.
PLoS One. 2014 May 2;9(5):e96694. doi: 10.1371/journal.pone.0096694. eCollection 2014.
As more and more protein sequences are uncovered from increasingly inexpensive sequencing techniques, an urgent task is to find their functions. This work presents a highly reliable computational technique for predicting DNA-binding function at the level of protein-DNA complex structures, rather than low-resolution two-state prediction of DNA-binding as most existing techniques do. The method first predicts protein-DNA complex structure by utilizing the template-based structure prediction technique HHblits, followed by binding affinity prediction based on a knowledge-based energy function (Distance-scaled finite ideal-gas reference state for protein-DNA interactions). A leave-one-out cross validation of the method based on 179 DNA-binding and 3797 non-binding protein domains achieves a Matthews correlation coefficient (MCC) of 0.77 with high precision (94%) and high sensitivity (65%). We further found 51% sensitivity for 82 newly determined structures of DNA-binding proteins and 56% sensitivity for the human proteome. In addition, the method provides a reasonably accurate prediction of DNA-binding residues in proteins based on predicted DNA-binding complex structures. Its application to human proteome leads to more than 300 novel DNA-binding proteins; some of these predicted structures were validated by known structures of homologous proteins in APO forms. The method [SPOT-Seq (DNA)] is available as an on-line server at http://sparks-lab.org.
随着越来越多的蛋白质序列通过日益廉价的测序技术被发现,一项紧迫的任务是确定它们的功能。这项工作提出了一种高度可靠的计算技术,用于在蛋白质 - DNA 复合物结构水平上预测 DNA 结合功能,而不是像大多数现有技术那样进行低分辨率的 DNA 结合二态预测。该方法首先利用基于模板的结构预测技术 HHblits 预测蛋白质 - DNA 复合物结构,然后基于基于知识的能量函数(蛋白质 - DNA 相互作用的距离缩放有限理想气体参考状态)进行结合亲和力预测。基于 179 个 DNA 结合和 3797 个非结合蛋白结构域对该方法进行留一法交叉验证,得到马修斯相关系数(MCC)为 0.77,具有高精度(94%)和高灵敏度(65%)。我们进一步发现,对于 82 个新确定的 DNA 结合蛋白结构,灵敏度为 51%,对于人类蛋白质组,灵敏度为 56%。此外,该方法基于预测的 DNA 结合复合物结构,对蛋白质中的 DNA 结合残基提供了合理准确的预测。将其应用于人类蛋白质组,发现了 300 多种新型 DNA 结合蛋白;其中一些预测结构通过 APO 形式同源蛋白的已知结构得到了验证。该方法 [SPOT-Seq (DNA)] 可在 http://sparks-lab.org 作为在线服务器使用。