Roche Rahmatullah, Moussad Bernard, Shuvo Md Hossain, Tarafder Sumit, Bhattacharya Debswapna
Department of Computer Science, Virginia Tech, Blacksburg, VA 24061, United States of America.
bioRxiv. 2023 Sep 16:2023.09.14.557719. doi: 10.1101/2023.09.14.557719.
Protein language models (pLMs) trained on a large corpus of protein sequences have shown unprecedented scalability and broad generalizability in a wide range of predictive modeling tasks, but their power has not yet been harnessed for predicting protein-nucleic acid binding sites, critical for characterizing the interactions between proteins and nucleic acids. Here we present EquiPNAS, a new pLM-informed E(3) equivariant deep graph neural network framework for improved protein-nucleic acid binding site prediction. By combining the strengths of pLM and symmetry-aware deep graph learning, EquiPNAS consistently outperforms the state-of-the-art methods for both protein-DNA and protein-RNA binding site prediction on multiple datasets across a diverse set of predictive modeling scenarios ranging from using experimental input to AlphaFold2 predictions. Our ablation study reveals that the pLM embeddings used in EquiPNAS are sufficiently powerful to dramatically reduce the dependence on the availability of evolutionary information without compromising on accuracy, and that the symmetry-aware nature of the E(3) equivariant graph-based neural architecture offers remarkable robustness and performance resilience. EquiPNAS is freely available at https://github.com/Bhattacharya-Lab/EquiPNAS.
在大量蛋白质序列语料库上训练的蛋白质语言模型(pLMs)在广泛的预测建模任务中展现出了前所未有的可扩展性和广泛的通用性,但它们在预测蛋白质-核酸结合位点方面的能力尚未得到利用,而这些位点对于表征蛋白质与核酸之间的相互作用至关重要。在此,我们展示了EquiPNAS,这是一种基于pLM的新型E(3)等变深度图神经网络框架,用于改进蛋白质-核酸结合位点预测。通过结合pLM和对称感知深度图学习的优势,EquiPNAS在从使用实验输入到AlphaFold2预测的各种预测建模场景下,在多个数据集上对蛋白质-DNA和蛋白质-RNA结合位点预测均持续优于当前的先进方法。我们的消融研究表明,EquiPNAS中使用的pLM嵌入足够强大,能够在不影响准确性的情况下显著降低对进化信息可用性的依赖,并且基于E(3)等变图的神经架构的对称感知特性提供了显著的稳健性和性能弹性。EquiPNAS可在https://github.com/Bhattacharya-Lab/EquiPNAS上免费获取。