Zhu Lun, Chen Zehua, Yang Sen
School of Computer Science and Artificial Intelligence, Aliyun School of Big Data, School of Software, Changzhou University, Changzhou, 213164, China.
The Affiliated Changzhou No. 2 People's Hospital of Nanjing Medical University, Changzhou, 213164, China.
Interdiscip Sci. 2024 Dec 23. doi: 10.1007/s12539-024-00673-4.
Cell-Penetrating Peptides (CPPs) are a crucial carrier for drug delivery. Since the process of synthesizing new CPPs in the laboratory is both time- and resource-consuming, computational methods to predict potential CPPs can be used to find CPPs to enhance the development of CPPs in therapy. In this study, EnDM-CPP is proposed, which combines machine learning algorithms (SVM and CatBoost) with convolutional neural networks (CNN and TextCNN). For dataset construction, three previous CPP benchmark datasets, including CPPsite 2.0, MLCPP 2.0, and CPP924, are merged to improve the diversity and reduce homology. For feature generation, two language model-based features obtained from the Transformer architecture, including ProtT5 and ESM-2, are employed in CNN and TextCNN. Additionally, sequence features, such as CPRS, Hybrid PseAAC, KSC, etc., are input to SVM and CatBoost. Based on the result of each predictor, Logistic Regression (LR) is built to predict the final decision. The experiment results indicate that ProtT5 and ESM-2 fusion features significantly contribute to predicting CPP and that combining employed features and models demonstrates better association. On an independent test dataset comparison, EnDM-CPP achieved an accuracy of 0.9495 and a Matthews correlation coefficient of 0.9008 with an improvement of 2.23%-9.48% and 4.32%-19.02%, respectively, compared with other state-of-the-art methods. Code and data are available at https://github.com/tudou1231/EnDM-CPP.git .
细胞穿透肽(CPPs)是药物递送的关键载体。由于在实验室中合成新的CPPs的过程既耗时又耗资源,因此可使用预测潜在CPPs的计算方法来寻找CPPs,以促进CPPs在治疗中的发展。在本研究中,提出了EnDM-CPP,它将机器学习算法(支持向量机和CatBoost)与卷积神经网络(卷积神经网络和文本卷积神经网络)相结合。对于数据集构建,合并了三个先前的CPP基准数据集,包括CPPsite 2.0、MLCPP 2.0和CPP924,以提高多样性并减少同源性。对于特征生成,在卷积神经网络和文本卷积神经网络中采用了从Transformer架构获得的两种基于语言模型的特征,包括ProtT5和ESM-2。此外,将序列特征,如CPRS、混合伪氨基酸组成、KSC等,输入到支持向量机和CatBoost中。基于每个预测器的结果,构建逻辑回归(LR)来预测最终决策。实验结果表明,ProtT5和ESM-2融合特征对预测CPP有显著贡献,并且结合使用的特征和模型表现出更好的关联性。在独立测试数据集比较中,与其他现有方法相比,EnDM-CPP的准确率达到0.9495,马修斯相关系数达到0.9008,分别提高了2.23%-9.48%和4.32%-19.02%。代码和数据可在https://github.com/tudou1231/EnDM-CPP.git获取。