Su Ming-Yang, Su Kuan-Lin
Department of Computer Science and Information Engineering, Ming Chuan University, Taoyuan City 333, Taiwan.
Sensors (Basel). 2023 Oct 16;23(20):8499. doi: 10.3390/s23208499.
Malicious uniform resource locators (URLs) are prevalent in cyberattacks, particularly in phishing attempts aimed at stealing sensitive information or distributing malware. Therefore, it is of paramount importance to accurately detect malicious URLs. Prior research has explored the use of deep-learning models to identify malicious URLs, using the segmentation of URL strings into character-level or word-level tokens, and embedding and employing trained models to differentiate between URLs. In this study, a bidirectional encoder representation from a transformers-based (BERT) model was devised to tokenize URL strings, employing its self-attention mechanism to enhance the understanding of correlations among tokens. Subsequently, a classifier was employed to determine whether a given URL was malicious. In evaluating the proposed methods, three different types of public datasets were utilized: a dataset consisting solely of URL strings from Kaggle, a dataset containing only URL features from GitHub, and a dataset including both types of data from the University of New Brunswick, namely, ISCX 2016. The proposed system achieved accuracy rates of 98.78%, 96.71%, and 99.98% on the three datasets, respectively. Additionally, experiments were conducted on two datasets from different domains-the Internet of Things (IoT) and Domain Name System over HTTPS (DoH)-to demonstrate the versatility of the proposed model.
恶意统一资源定位符(URL)在网络攻击中很常见,尤其是在旨在窃取敏感信息或传播恶意软件的网络钓鱼企图中。因此,准确检测恶意URL至关重要。先前的研究已经探索了使用深度学习模型来识别恶意URL,方法是将URL字符串分割为字符级或单词级令牌,并嵌入和使用经过训练的模型来区分URL。在本研究中,设计了一种基于变换器的双向编码器表示(BERT)模型来对URL字符串进行令牌化,利用其自注意力机制来增强对令牌之间相关性的理解。随后,使用分类器来确定给定的URL是否为恶意URL。在评估所提出的方法时,使用了三种不同类型的公共数据集:一个仅由来自Kaggle的URL字符串组成的数据集、一个仅包含来自GitHub的URL特征的数据集,以及一个包括来自新不伦瑞克大学的两种类型数据的数据集,即ISCX 2016。所提出的系统在这三个数据集上分别实现了98.78%、96.71%和99.98%的准确率。此外,还在来自不同领域的两个数据集——物联网(IoT)和超文本传输安全协议(HTTPS)上的域名系统(DoH)——上进行了实验,以证明所提出模型的通用性。