ProNA2020 可从序列预测蛋白质-DNA、蛋白质-RNA 和蛋白质-蛋白质结合蛋白及残基。

ProNA2020 predicts protein-DNA, protein-RNA, and protein-protein binding proteins and residues from sequence.

机构信息

Department of Informatics, I12-Chair of Bioinformatics and Computational Biology, Technical University of Munich (TUM), Boltzmannstrasse 3, 85748, Garching, Munich, Germany; TUM Graduate School, Center of Doctoral Studies in Informatics and Its Applications (CeDoSIA), Garching, 85748, Germany.

出版信息

J Mol Biol. 2020 Mar 27;432(7):2428-2443. doi: 10.1016/j.jmb.2020.02.026. Epub 2020 Mar 4.

DOI:10.1016/j.jmb.2020.02.026

PMID:32142788

Abstract

The intricate details of how proteins bind to proteins, DNA, and RNA are crucial for the understanding of almost all biological processes. Disease-causing sequence variants often affect binding residues. Here, we described a new, comprehensive system of in silico methods that take only protein sequence as input to predict binding of protein to DNA, RNA, and other proteins. Firstly, we needed to develop several new methods to predict whether or not proteins bind (per-protein prediction). Secondly, we developed independent methods that predict which residues bind (per-residue). Not requiring three-dimensional information, the system can predict the actual binding residue. The system combined homology-based inference with machine learning and motif-based profile-kernel approaches with word-based (ProtVec) solutions to machine learning protein level predictions. This achieved an overall non-exclusive three-state accuracy of 77% ± 1% (±one standard error) corresponding to a 1.8 fold improvement over random (best classification for protein-protein with F1 = 91 ± 0.8%). Standard neural networks for per-residue binding residue predictions appeared best for DNA-binding (Q2 = 81 ± 0.9%) followed by RNA-binding (Q2 = 80 ± 1%) and worst for protein-protein binding (Q2 = 69 ± 0.8%). The new method, dubbed ProNA2020, is available as code through github (https://github.com/Rostlab/ProNA2020.git) and through PredictProtein (www.predictprotein.org).

摘要

蛋白质与蛋白质、DNA 和 RNA 结合的复杂细节对于理解几乎所有的生物过程都是至关重要的。致病的序列变异通常会影响结合残基。在这里，我们描述了一种新的、全面的计算方法系统，该系统仅以蛋白质序列为输入，预测蛋白质与 DNA、RNA 和其他蛋白质的结合。首先，我们需要开发几种新的方法来预测蛋白质是否结合（针对每个蛋白质的预测）。其次，我们开发了独立的方法来预测哪些残基结合（针对每个残基的预测）。该系统不需要三维信息，就可以预测实际的结合残基。该系统将同源推断与机器学习结合，以及基于基序的轮廓核方法与基于单词的（ProtVec）机器学习蛋白水平预测解决方案相结合。这实现了总体非排他性的三状态准确率为 77%±1%（±一个标准误差），与随机预测相比提高了 1.8 倍（最佳的蛋白-蛋白分类 F1=91±0.8%）。用于预测结合残基的标准神经网络似乎最适合 DNA 结合（Q2=81±0.9%），其次是 RNA 结合（Q2=80±1%），而对于蛋白-蛋白结合则最差（Q2=69±0.8%）。新方法被称为 ProNA2020，可以通过 github（https://github.com/Rostlab/ProNA2020.git）和 PredictProtein（www.predictprotein.org）获得代码。