UniDL4BioPep：用于肽生物活性二元分类的通用深度学习架构。

UniDL4BioPep: a universal deep learning architecture for binary classification in peptide bioactivity.

机构信息

Department of Grain Science and Industry, Kansas State University, Manhattan, KS 66506, USA.

Department of Computer Science, Kansas State University, Manhattan, KS 66506, USA.

出版信息

Brief Bioinform. 2023 May 19;24(3). doi: 10.1093/bib/bbad135.

DOI:10.1093/bib/bbad135

PMID:37020337

Abstract

Identification of potent peptides through model prediction can reduce benchwork in wet experiments. However, the conventional process of model buildings can be complex and time consuming due to challenges such as peptide representation, feature selection, model selection and hyperparameter tuning. Recently, advanced pretrained deep learning-based language models (LMs) have been released for protein sequence embedding and applied to structure and function prediction. Based on these developments, we have developed UniDL4BioPep, a universal deep-learning model architecture for transfer learning in bioactive peptide binary classification modeling. It can directly assist users in training a high-performance deep-learning model with a fixed architecture and achieve cutting-edge performance to meet the demands in efficiently novel bioactive peptide discovery. To the best of our best knowledge, this is the first time that a pretrained biological language model is utilized for peptide embeddings and successfully predicts peptide bioactivities through large-scale evaluations of those peptide embeddings. The model was also validated through uniform manifold approximation and projection analysis. By combining the LM with a convolutional neural network, UniDL4BioPep achieved greater performances than the respective state-of-the-art models for 15 out of 20 different bioactivity dataset prediction tasks. The accuracy, Mathews correlation coefficient and area under the curve were 0.7-7, 1.23-26.7 and 0.3-25.6% higher, respectively. A user-friendly web server of UniDL4BioPep for the tested bioactivities is established and freely accessible at https://nepc2pvmzy.us-east-1.awsapprunner.com. The source codes, datasets and templates of UniDL4BioPep for other bioactivity fitting and prediction tasks are available at https://github.com/dzjxzyd/UniDL4BioPep.

摘要

通过模型预测来识别有效肽可以减少湿实验的工作量。然而，由于肽表示、特征选择、模型选择和超参数调整等挑战，传统的模型构建过程可能会很复杂且耗时。最近，用于蛋白质序列嵌入的先进的基于预训练的深度学习语言模型（LMs）已经发布，并应用于结构和功能预测。基于这些发展，我们开发了 UniDL4BioPep，这是一种用于生物活性肽二分类建模的迁移学习的通用深度学习模型架构。它可以直接帮助用户使用固定架构训练高性能深度学习模型，并实现最先进的性能，以满足高效发现新型生物活性肽的需求。据我们所知，这是第一次将预训练的生物语言模型用于肽嵌入，并通过大规模评估这些肽嵌入来成功预测肽的生物活性。该模型还通过均匀流形逼近和投影分析进行了验证。通过将 LM 与卷积神经网络相结合，UniDL4BioPep 在 20 个不同生物活性数据集预测任务中的 15 个任务中的表现优于各自的最先进模型。准确性、马修斯相关系数和曲线下面积分别提高了 0.7-7%、1.23-26.7%和 0.3-25.6%。建立了 UniDL4BioPep 的用户友好型网页服务器，用于测试的生物活性，可在 https://nepc2pvmzy.us-east-1.awsapprunner.com 上访问。UniDL4BioPep 的源代码、数据集和其他生物活性拟合和预测任务的模板可在 https://github.com/dzjxzyd/UniDL4BioPep 上获得。