Panda Binayak, Bisoyi Sudhanshu Shekhar, Panigrahy Sidhanta, Mohanty Prithviraj
Department of Computer Science and Engineering, Institute of Technical Education and Research, Siksha 'O' Anusandhan (Deemed to be) University, Bhubaneswar, Odisha, India.
Department of Computer Science and Information Technology, Institute of Technical Education and Research, Siksha 'O' Anusandhan (Deemed to be) University, Bhubaneswar, Odisha, India.
PeerJ Comput Sci. 2025 Mar 25;11:e2752. doi: 10.7717/peerj-cs.2752. eCollection 2025.
Detecting polymorphic or metamorphic variants of known malware is an ever-growing challenge, just like detecting new malware. Artificial intelligence techniques are preferred over conventional signature-based malware detection as the number of malware variants proliferates. This article proposes an Adaptive Multiclass Malware Classification (AMMC) framework that trains base machine learning models with fewer computational resources to detect malware. Furthermore, this work proposes a novel adaptive feature selection (AFS) technique using the greedy strategy on term frequency and inverse document frequency (TF-IDF) feature weights to address the selection of influential features and ensure better performance metrics in imbalanced multiclass malware classification problems. To assess AMMC's efficacy using AFS, three open imbalanced multiclass malware datasets (VirusShare with eight classes, VirusSample with six classes, and MAL-API-2019 with eight classes) on Windows API sequence features were used. Experimental results demonstrate the effectiveness of AMMC with AFS, achieving state-of-the-art performance on VirusShare, VirusSample, and MAL-API-2019 with a macro F1-score of 0.92, 0.94, and 0.84 and macro area under the curve (AUC) of 0.99, 0.99, and 0.98, respectively. The performance measurements obtained with AMMC for all datasets were highly promising.
检测已知恶意软件的多态或变形变体与检测新的恶意软件一样,是一个日益严峻的挑战。随着恶意软件变体数量的激增,人工智能技术比传统的基于签名的恶意软件检测方法更受青睐。本文提出了一种自适应多类恶意软件分类(AMMC)框架,该框架使用较少的计算资源训练基础机器学习模型来检测恶意软件。此外,这项工作提出了一种新颖的自适应特征选择(AFS)技术,该技术对词频-逆文档频率(TF-IDF)特征权重采用贪婪策略,以解决有影响特征的选择问题,并确保在不平衡多类恶意软件分类问题中获得更好的性能指标。为了使用AFS评估AMMC的有效性,我们使用了三个基于Windows API序列特征的开放不平衡多类恶意软件数据集(八类的VirusShare、六类的VirusSample和八类的MAL-API-2019)。实验结果证明了带有AFS的AMMC的有效性,在VirusShare、VirusSample和MAL-API-2019上分别以0.92、0.94和0.84的宏F1分数以及0.99、0.99和0.98的宏曲线下面积(AUC)达到了当前最优性能。使用AMMC对所有数据集获得的性能测量结果非常可观。