School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, 230036, China.
School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, 230036, China.
Comput Biol Med. 2024 Oct;181:109048. doi: 10.1016/j.compbiomed.2024.109048. Epub 2024 Aug 24.
Neuropeptides are the most ubiquitous neurotransmitters in the immune system, regulating various biological processes. Neuropeptides play a significant role for the discovery of new drugs and targets for nervous system disorders. Traditional experimental methods for identifying neuropeptides are time-consuming and costly. Although several computational methods have been developed to predict the neuropeptides, the accuracy is still not satisfactory due to the representability of the extracted features. In this work, we propose an efficient and interpretable model, NeuroPpred-SHE, for predicting neuropeptides by selecting the optimal feature subset from both hand-crafted features and embeddings of a protein language model. Specially, we first employed a pre-trained T5 protein language model to extract embedding features and twelve other encoding methods to extract hand-crafted features from peptide sequences, respectively. Secondly, we fused both embedding features and hand-crafted features to enhance the feature representability. Thirdly, we utilized random forest (RF), Max-Relevance and Min-Redundancy (mRMR) and eXtreme Gradient Boosting (XGBoost) methods to select the optimal feature subset from the fused features. Finally, we employed five machine learning methods (GBDT, XGBoost, SVM, MLP, and LightGBM) to build the models. Our results show that the model based on GBDT achieves the best performance. Furthermore, our final model was compared with other state-of-the-art methods on an independent test set, the results indicate that our model achieves an AUROC of 97.8 % which is higher than all the other state-of-the-art predictors. Our model is available at: https://github.com/wenjean/NeuroPpred-SHE.
神经肽是免疫系统中最普遍的神经递质,调节着各种生物过程。神经肽在发现神经系统疾病的新药物和靶点方面发挥着重要作用。传统的鉴定神经肽的实验方法既耗时又昂贵。尽管已经开发了几种计算方法来预测神经肽,但由于提取特征的代表性,准确性仍然不尽如人意。在这项工作中,我们提出了一种高效且可解释的模型 NeuroPpred-SHE,通过从蛋白质语言模型的手工特征和嵌入特征中选择最优特征子集来预测神经肽。特别地,我们首先使用预先训练好的 T5 蛋白质语言模型从肽序列中提取嵌入特征和其他 12 种编码方法提取手工特征。其次,我们融合了嵌入特征和手工特征,以增强特征的表示能力。然后,我们利用随机森林(RF)、最大相关性和最小冗余度(mRMR)和极端梯度提升(XGBoost)方法从融合特征中选择最优特征子集。最后,我们使用了 5 种机器学习方法(GBDT、XGBoost、SVM、MLP 和 LightGBM)来构建模型。我们的结果表明,基于 GBDT 的模型取得了最佳性能。此外,我们在独立测试集上与其他最先进的方法进行了比较,结果表明,我们的模型的 AUROC 达到了 97.8%,高于所有其他最先进的预测器。我们的模型可以在 https://github.com/wenjean/NeuroPpred-SHE 上获取。