Tian Ye, Dai Xin, Li Zhijun, Guo Hong, Mao Xiao
Academy of Forensic Science, Shanghai, China.
Shanghai Forensic Service Platform, Key Laboratory of Forensic Science, Ministry of Justice, Shanghai, China.
PLoS One. 2025 Sep 3;20(9):e0331574. doi: 10.1371/journal.pone.0331574. eCollection 2025.
With the widespread adoption of internet technologies and email communication systems, the exponential growth in email usage has precipitated a corresponding surge in spam proliferation. These unsolicited messages not only consume users' valuable time through information overload but also pose significant cybersecurity threats through malware distribution and phishing schemes, thereby jeopardizing both digital security and user experience. This emerging challenge underscores the critical importance of developing effective spam detection mechanisms as a cornerstone of modern cybersecurity infrastructure. Through empirical analysis of machine learning (ML) performance on publicly available spam datasets, we established that algorithmic ensemble methods consistently outperform individual models in detection accuracy. We propose an optimized stacking ensemble framework that strategically combines predictions from four heterogeneous base models (NBC, k-NN, LR, XGBoost) through meta-learner integration. Our methodology incorporates grid search cross-validation with hyperparameter space optimization, enabling systematic identification of parameter configurations that maximize detection performance. The enhanced model was rigorously evaluated using comprehensive metrics including accuracy (99.79%), precision, recall, and F1-score, demonstrating statistically significant improvements over both baseline models and existing solutions documented in the literature.
随着互联网技术和电子邮件通信系统的广泛采用,电子邮件使用量的指数级增长引发了垃圾邮件泛滥的相应激增。这些未经请求的消息不仅通过信息过载消耗用户的宝贵时间,还通过恶意软件传播和网络钓鱼计划构成重大的网络安全威胁,从而危及数字安全和用户体验。这一新兴挑战凸显了开发有效的垃圾邮件检测机制作为现代网络安全基础设施基石的至关重要性。通过对公开可用的垃圾邮件数据集上机器学习(ML)性能的实证分析,我们确定算法集成方法在检测准确性方面始终优于单个模型。我们提出了一个优化的堆叠集成框架,该框架通过元学习器集成策略性地结合了来自四个异构基础模型(朴素贝叶斯分类器、k近邻、逻辑回归、极端梯度提升)的预测。我们的方法将网格搜索交叉验证与超参数空间优化相结合,能够系统地识别使检测性能最大化的参数配置。使用包括准确率(99.79%)、精确率、召回率和F1分数在内的综合指标对增强后的模型进行了严格评估,结果表明与基线模型和文献中记录的现有解决方案相比有统计学上的显著改进。