用于改进安卓恶意软件检测的机器学习模型与降维

Machine learning models and dimensionality reduction for improving the Android malware detection.

作者信息

Morán Pablo, Robles-Gómez Antonio, Duque Andres, Tobarra Llanos, Pastor-Vargas Rafael

机构信息

Departamento de Sistemas de Comunicación y Control, Universidad Nacional de Educación a Distancia, Madrid, Spain.

Departamento de Lenguajes y Sistemas Informáticos, Universidad Nacional de Educación a Distancia, Madrid, Spain.

出版信息

PeerJ Comput Sci. 2024 Dec 23;10:e2616. doi: 10.7717/peerj-cs.2616. eCollection 2024.

DOI:10.7717/peerj-cs.2616

PMID:39896377

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11784760/

Abstract

Today, a great number of attack opportunities for cybercriminals arise in Android, since it is one of the most used operating systems for many mobile applications. Hence, it is very important to anticipate these situations. To minimize this problem, the analysis of malware search applications is based on machine learning algorithms. Our work uses as a starting point the features proposed by the DREBIN project, which today constitutes a key reference in the literature, being the largest public Android malware dataset with labeled families. The authors only employ the support vector machine to determine whether a sample is malware or not. This work first proposes a new efficient dimensionality reduction of features, as well as the application of several supervised machine learning algorithms for prediction purposes. Predictive models based on Random Forest are found to achieve the most promising results. They can detect an average of 91.72% malware samples, with a very low false positive rate of 0.13%, and using only 5,000 features. This is just over 9% of the total number of features of DREBIN. It achieves an accuracy of 99.52%, a total precision of 96.91%, as well as a macro average F1-score of 96.99%.

摘要

如今，由于安卓系统是众多移动应用中使用最为广泛的操作系统之一，网络犯罪分子面临着大量攻击机会。因此，预测这些情况非常重要。为了尽量减少这个问题，对恶意软件搜索应用程序的分析基于机器学习算法。我们的工作以DREBIN项目提出的特征为出发点，该项目如今是文献中的关键参考，是最大的带有标记家族的公开安卓恶意软件数据集。作者仅使用支持向量机来确定一个样本是否为恶意软件。这项工作首先提出了一种新的高效特征降维方法，以及应用几种监督式机器学习算法进行预测。基于随机森林的预测模型被发现能取得最有前景的结果。它们平均能检测出91.72%的恶意软件样本，误报率极低，仅为0.13%，且仅使用5000个特征。这仅略高于DREBIN总特征数的9%。它的准确率达到99.52%，总精确率为96.91%，宏平均F1分数为96.99%。