School of Artificial Intelligence, Jilin University, Changchun, China.
Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Canada.
Commun Biol. 2024 Jun 3;7(1):679. doi: 10.1038/s42003-024-06332-0.
Proteins and nucleic-acids are essential components of living organisms that interact in critical cellular processes. Accurate prediction of nucleic acid-binding residues in proteins can contribute to a better understanding of protein function. However, the discrepancy between protein sequence information and obtained structural and functional data renders most current computational models ineffective. Therefore, it is vital to design computational models based on protein sequence information to identify nucleic acid binding sites in proteins. Here, we implement an ensemble deep learning model-based nucleic-acid-binding residues on proteins identification method, called SOFB, which characterizes protein sequences by learning the semantics of biological dynamics contexts, and then develop an ensemble deep learning-based sequence network to learn feature representation and classification by explicitly modeling dynamic semantic information. Among them, the language learning model, which is constructed from natural language to biological language, captures the underlying relationships of protein sequences, and the ensemble deep learning-based sequence network consisting of different convolutional layers together with Bi-LSTM refines various features for optimal performance. Meanwhile, to address the imbalanced issue, we adopt ensemble learning to train multiple models and then incorporate them. Our experimental results on several DNA/RNA nucleic-acid-binding residue datasets demonstrate that our proposed model outperforms other state-of-the-art methods. In addition, we conduct an interpretability analysis of the identified nucleic acid binding residue sequences based on the attention weights of the language learning model, revealing novel insights into the dynamic semantic information that supports the identified nucleic acid binding residues. SOFB is available at https://github.com/Encryptional/SOFB and https://figshare.com/articles/online_resource/SOFB_figshare_rar/25499452 .
蛋白质和核酸是生物的基本组成部分,它们在关键的细胞过程中相互作用。准确预测蛋白质中的核酸结合残基可以帮助更好地理解蛋白质的功能。然而,蛋白质序列信息与获得的结构和功能数据之间的差异使得大多数当前的计算模型无效。因此,基于蛋白质序列信息设计计算模型来识别蛋白质中的核酸结合位点至关重要。在这里,我们实现了一种基于集成深度学习模型的蛋白质核酸结合残基识别方法,称为 SOFB,它通过学习生物动力学上下文的语义来描述蛋白质序列,然后开发一个基于集成深度学习的序列网络,通过显式建模动态语义信息来学习特征表示和分类。其中,从自然语言到生物语言构建的语言学习模型捕捉了蛋白质序列的潜在关系,由不同卷积层和 Bi-LSTM 组成的集成深度学习序列网络一起细化了各种特征,以获得最佳性能。同时,为了解决不平衡问题,我们采用集成学习来训练多个模型并将它们合并。我们在几个 DNA/RNA 核酸结合残基数据集上的实验结果表明,我们提出的模型优于其他最先进的方法。此外,我们还基于语言学习模型的注意力权重对识别出的核酸结合残基序列进行了可解释性分析,揭示了支持识别出的核酸结合残基的动态语义信息的新见解。SOFB 可在 https://github.com/Encryptional/SOFB 和 https://figshare.com/articles/online_resource/SOFB_figshare_rar/25499452 上获得。