Marques Cláudio, Malta Silvestre, Magalhães João Paulo
Escola Superior de Tecnologia e Gestão, Politécnico de Viana do Castelo, Viana do Castelo 4900-348, Portugal.
ADiT-Lab, Escola Superior de Tecnologia e Gestão, Politécnico de Viana do Castelo, Viana do Castelo 4900-348, Portugal.
Data Brief. 2021 Sep 4;38:107342. doi: 10.1016/j.dib.2021.107342. eCollection 2021 Oct.
The Domain Name Service (DNS) is a central point in the functioning of the internet. Just as organizations use domain names to enable the access to their computational services, malicious actors make use of domain names to point to the services under their control. Distinguishing between non-malicious and malicious domain names is extremely important, as it allows to grant or block the access to external services, maximizing the security of the organization and users. Nowadays there are many DNS firewall solutions. Most of these are based on known malicious domain lists that are being constantly updated. However, in this way, it is only possible to block known malicious communications, leaving out many others that can be malicious but are not known. Adopting machine learning to classify domains contributes to the detection of domains that are not yet on the block list. The dataset described in this manuscript is meant for supervised machine learning-based analysis of malicious and non-malicious domain names. The dataset was created from scratch, using publicly DNS logs of both malicious and non-malicious domain names. Using the domain name as input, 34 features were obtained. Features like the domain name entropy, number of strange characters and domain name length were obtained directly from the domain name. Other features like, domain name creation date, Internet Protocol (IP), open ports, geolocation were obtained from data enrichment processes (e.g. Open Source Intelligence (OSINT)). The class was determined considering the data source (malicious DNS log files and non-malicious DNS log files). The dataset consists of data from approximately 90000 domain names and it is balanced between 50% non-malicious and 50% of malicious domain names.
域名系统(DNS)是互联网运行的核心要点。正如组织使用域名来访问其计算服务一样,恶意行为者也利用域名来指向他们控制的服务。区分非恶意和恶意域名极为重要,因为这可以授予或阻止对外部服务的访问,从而最大限度地提高组织和用户的安全性。如今有许多DNS防火墙解决方案。其中大多数基于不断更新的已知恶意域名列表。然而,通过这种方式,只能阻止已知的恶意通信,而遗漏了许多可能是恶意但未知的其他通信。采用机器学习对域名进行分类有助于检测尚未在阻止列表中的域名。本手稿中描述的数据集旨在用于基于监督机器学习的恶意和非恶意域名分析。该数据集是从头创建的,使用了恶意和非恶意域名的公开DNS日志。以域名作为输入,获得了34个特征。诸如域名熵、奇怪字符数量和域名长度等特征直接从域名中获取。其他特征,如域名创建日期、互联网协议(IP)、开放端口、地理位置等,则从数据丰富过程(如开源情报(OSINT))中获取。类别是根据数据源(恶意DNS日志文件和非恶意DNS日志文件)确定的。该数据集包含来自大约90000个域名的数据,并且在50%的非恶意域名和50%的恶意域名之间保持平衡。