Hasan Rakibul, Biswas Barna, Samiun Md, Saleh Mohammad Abu, Prabha Mani, Akter Jahanara, Joya Fatema Haque, Abdullah Masuk
Department of Business Administration, Westcliff University, 17877 Von Karman Ave 4th Floor, Irvine, CA, 92614, USA.
Department of Business Administration, International American University, 3440 Wilshire Blvd STE 1000, Los Angeles, CA, 90010, USA.
Sci Rep. 2025 Mar 17;15(1):9122. doi: 10.1038/s41598-025-93447-x.
The increasing prevalence of malware presents a critical challenge to cybersecurity, emphasizing the need for robust detection methods. This study uses a binary tabular classification dataset to evaluate the impact of feature selection, feature scaling, and machine learning (ML) models on malware detection. The methodology involves experimenting with three feature scaling techniques (no scaling, normalization, and min-max scaling), three feature selection methods (no selection, Linear Discriminant Analysis (LDA), and Principal Component Analysis (PCA)), and twelve ML models, including traditional algorithms and ensemble methods. A publicly available dataset with 11,598 samples and 139 features is utilized, and model performance is assessed using metrics such as accuracy, precision, recall, F1-score, and AUC-ROC. Results reveal that the Light Gradient Boosting Machine (LGBM) achieves the highest accuracy of 97.16% when PCA and either min-max scaling or normalization are applied. Additionally, ensemble models consistently outperform traditional ML models, demonstrating their effectiveness in enhancing malware detection. These findings offer valuable insights into optimizing preprocessing and model selection strategies for developing reliable and efficient malware detection systems.
恶意软件的日益流行对网络安全构成了严峻挑战,凸显了强大检测方法的必要性。本研究使用二元表格分类数据集来评估特征选择、特征缩放和机器学习(ML)模型对恶意软件检测的影响。该方法包括试验三种特征缩放技术(无缩放、归一化和最小-最大缩放)、三种特征选择方法(无选择、线性判别分析(LDA)和主成分分析(PCA))以及十二个ML模型,包括传统算法和集成方法。使用了一个包含11598个样本和139个特征 的公开可用数据集,并使用诸如准确率、精确率、召回率、F1分数和AUC-ROC等指标评估模型性能。结果表明,当应用PCA以及最小-最大缩放或归一化时,轻量级梯度提升机(LGBM)实现了97.16%的最高准确率。此外,集成模型始终优于传统ML模型,证明了它们在增强恶意软件检测方面的有效性。这些发现为优化预处理和模型选择策略以开发可靠且高效的恶意软件检测系统提供了有价值的见解。