Bragança Hendrio, Rocha Vanderson, Barcellos Lucas, Souto Eduardo, Kreutz Diego, Feitosa Eduardo
Institute of Computing, Federal University of Amazonas, Amazonas, Brazil.
Federal University of Pampa, Rio Grande do Sul, Brazil.
Data Brief. 2023 Nov 2;51:109750. doi: 10.1016/j.dib.2023.109750. eCollection 2023 Dec.
High-quality datasets are crucial for building realistic and high-performance supervised malware detection models. Currently, one of the major challenges of machine learning-based solutions is the scarcity of datasets that are both representative and of high quality. To foster future research and provide updated and public data for comprehensive evaluation and comparison of existing classifiers, we introduce the MH-100K dataset [1], an extensive collection of Android malware information comprising 101,975 samples. It encompasses a main CSV file with valuable metadata, including the SHA256 hash (APK's signature), file name, package name, Android's official compilation API, 166 permissions, 24,417 API calls, and 250 intents. Moreover, the MH-100K dataset features an extensive collection of files containing useful metadata of the VirusTotal1 analysis. This repository of information can serve future research by enabling the analysis of antivirus scan result patterns to discern the prevalence and behaviour of various malware families. Such analysis can help to extend existing malware taxonomies, the identification of novel variants, and the exploration of malware evolution over time.
高质量数据集对于构建逼真且高性能的有监督恶意软件检测模型至关重要。目前,基于机器学习的解决方案面临的主要挑战之一是缺乏具有代表性且高质量的数据集。为了促进未来的研究,并为现有分类器的全面评估和比较提供更新的公开数据,我们引入了MH-100K数据集[1],这是一个包含101,975个样本的大量安卓恶意软件信息集合。它包含一个带有宝贵元数据的主CSV文件,包括SHA256哈希(APK的签名)、文件名、包名、安卓官方编译API、166个权限、24,417个API调用以及250个意图。此外,MH-100K数据集还包含大量文件,这些文件包含了VirusTotal1分析的有用元数据。这个信息库可通过分析杀毒扫描结果模式来识别各种恶意软件家族的流行情况和行为,从而为未来的研究提供帮助。这样的分析有助于扩展现有的恶意软件分类法、识别新变种以及探索恶意软件随时间的演变。