IEEE/ACM Trans Comput Biol Bioinform. 2023 Nov-Dec;20(6):3809-3819. doi: 10.1109/TCBB.2023.3323295. Epub 2023 Dec 25.
Bioactive peptides are defined as peptide sequences within a protein that can regulate important bodily functions through their myriad activities. With the development of machine learning, more computational methods were proposed for bioactive peptides recognition so that this task does not only rely on tedious and time-consuming wet-experiment. But the training and testing process of existing models are limited to small datasets, which affects model performance. Inspired by the success of sequence classification in natural language processing with unlabeled data, we proposed a pre-training method for Bioactive peptides recognition. By pre-trained with large-scale of protein sequences, our method achieved the best performance in multiple functional peptides identification including anti-cancer, anti-diabetic, anti-hypertensive, anti-inflammatory and anti-microbial peptides. Compared with the advanced model, our model's precision, coverage, accuracy and absolute true are improved by 7.2%, 6.9%, 6.1% and 4.2% in the result of 5-fold cross-validation. In addition, the results indicate the model has superior prediction performance in single functional peptides recognition, especially for anti-cancer peptides and anti-microbial peptides which with longer sequences.
生物活性肽是指蛋白质中的肽序列,通过其多种活性可以调节重要的身体功能。随着机器学习的发展,提出了更多用于生物活性肽识别的计算方法,使这项任务不仅依赖于繁琐且耗时的湿实验。但是,现有模型的训练和测试过程仅限于小数据集,这会影响模型性能。受自然语言处理中使用未标记数据进行序列分类成功的启发,我们提出了一种用于生物活性肽识别的预训练方法。通过对大规模蛋白质序列进行预训练,我们的方法在多种功能肽识别中取得了最佳性能,包括抗癌肽、抗糖尿病肽、抗高血压肽、抗炎肽和抗菌肽。与先进模型相比,我们的模型在 5 倍交叉验证的结果中,精度、覆盖度、准确率和绝对真度分别提高了 7.2%、6.9%、6.1%和 4.2%。此外,结果表明该模型在单功能肽识别方面具有卓越的预测性能,尤其是对于序列较长的抗癌肽和抗菌肽。