Department of Information Technology, National Institute of Technology, Raipur, G.E.Road Raipur, C.G. -492010, India.
Vaccine. 2024 Jul 11;42(18):3874-3882. doi: 10.1016/j.vaccine.2024.04.078. Epub 2024 May 3.
Reverse vaccinology (RV) is a significant step in sensible vaccine design. In recent years, many machine learning (ML) methods have been used to improve RV prediction accuracy. However, there are still issues with prediction accuracy and programme accessibility in ML-based RV. This paper presents a supervised ML-based method to classify bacterial protective antigens (BPAgs) and identify the model(s) that consistently perform well for the training dataset. Six ML classifiers are used for testing with physiochemical features extracted from a comprehensive training dataset. Selecting the best performing model from different performance metrics (accuracy, precision, recall, F1-score, and AUC-ROC) has not been easy, because all the metrics has the same importance to predict BPAgs. To fix this issue, we propose a soft and hard ranking model based on multi-criteria decision-making (MCDM) approach for selecting the best performing ML method that classifies BPAgs. First, our proposed model uses homologous proteins (positive and negative samples) from Protegen and Uniprot databases. Second, we applied four strategies of Synthetic Minority Oversampling Technique and Edited Nearest Neighbour (SMOTE-ENN) to handle the data imbalance problem and train the model using ML methods. Third, we consider MCDM-based technique for order preference by similarity to the ideal solution (TOPSIS) method integrated with soft and hard ranking model. The entropy is used to obtain weighted evaluation criteria for ranking the models. Our experimental evaluations show that the proposed method with best performing models (Random Forest and Extreme Gradient Boosting) outperforms compared to existing open-source RV methods using benchmark datasets.
反向疫苗学(RV)是合理疫苗设计的重要步骤。近年来,许多机器学习(ML)方法被用于提高 RV 预测准确性。然而,基于 ML 的 RV 仍然存在预测准确性和程序可访问性问题。本文提出了一种基于监督机器学习的方法,用于对细菌保护性抗原(BPAgs)进行分类,并识别出在训练数据集上表现良好的模型。使用从综合训练数据集中提取的物理化学特征,对六种 ML 分类器进行测试。从不同的性能指标(准确性、精度、召回率、F1 分数和 AUC-ROC)中选择性能最佳的模型并不容易,因为所有指标对于预测 BPAgs 都具有相同的重要性。为了解决这个问题,我们提出了一种基于多准则决策(MCDM)方法的软、硬排序模型,用于选择性能最佳的 ML 方法来对 BPAgs 进行分类。首先,我们的模型使用 Protegen 和 Uniprot 数据库中的同源蛋白(阳性和阴性样本)。其次,我们应用了四种 Synthetic Minority Oversampling Technique 和 Edited Nearest Neighbour(SMOTE-ENN)策略来处理数据不平衡问题,并使用 ML 方法训练模型。第三,我们考虑了基于 MCDM 的 Technique for Order Preference by Similarity to the Ideal Solution(TOPSIS)方法与软、硬排序模型的集成。熵用于为模型排序获得加权评估标准。我们的实验评估表明,与使用基准数据集的现有开源 RV 方法相比,使用最佳性能模型(随机森林和极端梯度提升)的建议方法表现更好。