Ahmed Fee Faysal, Podder Anamika, Bulbul Md Farhad, Hossain Md Amzad, Hasan Mahedi, Sarkar Md Abdur Rauf, Kim Daijin
Department of Mathematics, Jashore University of Science and Technology, Jashore, 7408, Bangladesh.
Department of Computer Science & Engineering, Pohang University of Science and Technology (POSTECH), 77 Cheongam, Pohang 37673, Korea.
Comb Chem High Throughput Screen. 2024;27(9):1381-1393. doi: 10.2174/1386207326666230912151932.
To elucidate the detailed mechanisms of citrullination at the molecular level and design drugs applicable to major human diseases, predicting protein citrullination sites (PCSs) is essential. Using experimental approaches to predict PCSs is time-consuming and costly. However, there is a limited scope of the current PCS predictors. In particular, most predictors are commonly used for PCS prediction and have limited performance scores.
This work aims to provide an improved sophisticated predictor of citrullination sites using a benchmark dataset in a machine learning platform.
This study presents a reliable citrullination site predictor based on a benchmark dataset containing a 1:1 ratio of positive and negative samples. We classified citrullination sites using the Composition of the K-Spaced Amino Acid Pairs (CKSAAP) and Support Vector Machine (SVM).
We developed PCS predictors using integrated machine-learning methods that produced the highest average scores. Using 10-fold cross-validation on test datasets, the True Positive Rate (TPR) was 98.34%, the True Negative Rate (TNR) was 99.44%, the accuracy was 98.89%, the Mathew Correlation Coefficient (MCC) was 98.21%, the Area Under the ROC Curve (AUC) was 0.999, and the partial Area Under the ROC Curve (pAUC) was 0.1968.
According to overall performance, our developed predictor has a significantly higher implementation in comparison with the current tools on the same benchmark dataset. Moreover, it showed better performance metrics on both test and training datasets. Our developed predictor is promising and can be implemented as a complementary technique for identifying fast and precise citrullination sites.
为了在分子水平上阐明瓜氨酸化的详细机制并设计适用于主要人类疾病的药物,预测蛋白质瓜氨酸化位点(PCSs)至关重要。使用实验方法预测PCSs既耗时又昂贵。然而,当前的PCSs预测器的范围有限。特别是,大多数预测器通常用于PCSs预测,并且性能得分有限。
这项工作旨在使用机器学习平台中的基准数据集提供一种改进的复杂瓜氨酸化位点预测器。
本研究基于包含正负样本1:1比例的基准数据集提出了一种可靠的瓜氨酸化位点预测器。我们使用K间隔氨基酸对组成(CKSAAP)和支持向量机(SVM)对瓜氨酸化位点进行分类。
我们使用集成机器学习方法开发了PCSs预测器,该方法产生了最高的平均得分。在测试数据集上使用10折交叉验证,真阳性率(TPR)为98.34%,真阴性率(TNR)为99.44%,准确率为98.89%,马修相关系数(MCC)为98.21%,ROC曲线下面积(AUC)为0.999,ROC曲线部分面积(pAUC)为0.1968。
根据整体性能,我们开发的预测器与同一基准数据集上的当前工具相比具有显著更高的实施效果。此外,它在测试和训练数据集上均表现出更好的性能指标。我们开发的预测器很有前景,可以作为一种识别快速准确的瓜氨酸化位点的补充技术来实施。