School of Information Science and Engineering, University of Jinan, 250022 Jinan, Shandong, China.
Laboratory of Zoology, Graduate School of Bioresource and Bioenvironmental Sciences, Kyushu University, Fukuoka-shi, 819-0395 Fukuoka, Japan.
Front Biosci (Landmark Ed). 2023 Dec 1;28(12):322. doi: 10.31083/j.fbl2812322.
Peroxisomes are membrane-bound organelles that contain one or more types of oxidative enzymes. Aberrant localization of peroxisomal proteins can contribute to the development of various diseases. To more accurately identify and locate peroxisomal proteins, we developed the ProSE-Pero model.
We employed three methods based on deep representation learning models to extract the characteristics of peroxisomal proteins and compared their performance. Furthermore, we used the SVMSMOTE balanced dataset, SHAP interpretation model, variance analysis (ANOVA), and light gradient boosting machine (LightGBM) to select and compare the extracted features. We also constructed several traditional machine learning methods and four deep learning models to train and test our model on a dataset of 160 peroxisomal proteins using tenfold cross-validation.
Our proposed ProSE-Pero model achieves high performance with a specificity (Sp) of 93.37%, a sensitivity (Sn) of 82.41%, an accuracy (Acc) of 95.77%, a Matthews correlation coefficient (MCC) of 0.8241, an F1 score of 0.8996, and an area under the curve (AUC) of 0.9818. Additionally, we extended our method to identify plant vacuole proteins and achieved an accuracy of 91.90% on the independent test set, which is approximately 5% higher than the latest iPVP-DRLF model.
Our model surpasses the existing In-Pero model in terms of peroxisomal protein localization and identification. Additionally, our study showcases the proficient performance of the pre-trained multitasking language model ProSE in extracting features from protein sequences. With its established validity and broad generalization, our model holds considerable potential for expanding its application to the localization and identification of proteins in other organelles, such as mitochondria and Golgi proteins, in future investigations.
过氧化物酶体是一种含有一种或多种氧化酶的膜结合细胞器。过氧化物酶体蛋白的异常定位可能导致各种疾病的发生。为了更准确地识别和定位过氧化物酶体蛋白,我们开发了 ProSE-Pero 模型。
我们采用了三种基于深度表示学习模型的方法来提取过氧化物酶体蛋白的特征,并比较了它们的性能。此外,我们使用了 SVMSMOTE 平衡数据集、SHAP 解释模型、方差分析(ANOVA)和轻梯度提升机(LightGBM)来选择和比较提取的特征。我们还构建了几种传统机器学习方法和四个深度学习模型,在一个包含 160 个过氧化物酶体蛋白的数据集上使用十折交叉验证对我们的模型进行训练和测试。
我们提出的 ProSE-Pero 模型具有较高的性能,特异性(Sp)为 93.37%,敏感性(Sn)为 82.41%,准确性(Acc)为 95.77%,马修斯相关系数(MCC)为 0.8241,F1 分数为 0.8996,曲线下面积(AUC)为 0.9818。此外,我们将我们的方法扩展到识别植物液泡蛋白,并在独立测试集上获得了 91.90%的准确率,比最新的 iPVP-DRLF 模型高约 5%。
我们的模型在过氧化物酶体蛋白定位和识别方面优于现有的 In-Pero 模型。此外,我们的研究展示了预训练的多任务语言模型 ProSE 在从蛋白质序列中提取特征方面的出色表现。我们的模型具有较高的有效性和广泛的泛化能力,在未来的研究中,有望将其应用于其他细胞器(如线粒体和高尔基体蛋白)的定位和识别。