用于SARS和COVID-19疫苗设计的线性B细胞表位预测：集成平衡集成学习模型和重采样策略

Linear B-cell epitope prediction for SARS and COVID-19 vaccine design: Integrating balanced ensemble learning models and resampling strategies.

作者信息

Gurcan Fatih

机构信息

Department of Management Information Systems, Faculty of Economics and Administrative Sciences, Karadeniz Technical University, Trabzon, Turkey.

出版信息

PeerJ Comput Sci. 2025 Jun 18;11:e2970. doi: 10.7717/peerj-cs.2970. eCollection 2025.

DOI:10.7717/peerj-cs.2970

PMID:40567760

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12193457/

Abstract

This study presents a comprehensive machine learning framework to enhance the prediction accuracy of B-cell epitopes and antibody recognition related to Severe Acute Respiratory Syndrome (SARS) and Coronavirus Disease 2019 (COVID-19). To address the issue of data imbalance, various resampling techniques were applied using three types of strategies: over-sampling, under-sampling, and hybrid-sampling. The implemented resampling methods were designed to improve class balance and enhance model training. The rebalanced datasets were then used in model building with ensemble classifiers employing Boosting, Bagging, and Balancing strategies. Hyperparameter optimization for the classifiers was conducted using GridSearchCV, while feature selection was performed with the recursive feature elimination (RFE) algorithm. Model performance was evaluated using seven different metrics: Accuracy, Precision, Recall, F1-score, receiver operating characteristic area under the curve (ROC AUC), precision recall area under the curve (PR AUC), and Matthews correlation coefficient (MCC). Furthermore, statistical significance analyses including paired t-test, Wilcoxon, and permutation tests confirmed the reliability of the model improvements. To interpret the model's predictive behavior, peptides with the highest confidence among correctly classified instances were identified as potential epitope candidates. The results indicate that the combination of Synthetic Minority Over-Sampling Technique-Edited Nearest Neighbors (SMOTE-ENN), and ExtraTrees yielded the best performance, achieving an ROC AUC score of 0.9899. The combination of Instance Hardness Threshold (IHT) and ExtraTrees followed closely with a score of 0.9799. These findings emphasize the effectiveness of integrating resampling models and balancing ensemble classifiers in improving the accuracy of B-cell epitope prediction and antibody recognition for SARS and COVID-19 infections. This study contributes to vaccine development efforts and the advancement of immunoinformatics research by identifying promising epitope candidates.

摘要

本研究提出了一个全面的机器学习框架，以提高与严重急性呼吸综合征（SARS）和2019冠状病毒病（COVID-19）相关的B细胞表位及抗体识别的预测准确性。为解决数据不平衡问题，采用了三种策略应用各种重采样技术：过采样、欠采样和混合采样。所实施的重采样方法旨在改善类别平衡并增强模型训练。然后，将重新平衡的数据集用于使用Boosting、Bagging和平衡策略的集成分类器进行模型构建。使用GridSearchCV对分类器进行超参数优化，同时使用递归特征消除（RFE）算法进行特征选择。使用七种不同的指标评估模型性能：准确率、精确率、召回率、F1分数、曲线下面积（ROC AUC）、精确召回曲线下面积（PR AUC）和马修斯相关系数（MCC）。此外，包括配对t检验、威尔科克森检验和排列检验在内的统计显著性分析证实了模型改进的可靠性。为解释模型的预测行为，在正确分类的实例中具有最高置信度的肽被确定为潜在的表位候选物。结果表明，合成少数类过采样技术编辑最近邻法（SMOTE-ENN）和极端随机树（ExtraTrees）的组合产生了最佳性能，ROC AUC得分为0.9899。实例硬度阈值（IHT）和ExtraTrees的组合紧随其后，得分为0.9799。这些发现强调了整合重采样模型和平衡集成分类器在提高SARS和COVID-19感染的B细胞表位预测及抗体识别准确性方面的有效性。本研究通过识别有前景的表位候选物，为疫苗开发工作和免疫信息学研究的进展做出了贡献。