School of Cyberspace Security, Beijing University of Posts and Telecommunications, Beijing 100876, China.
School of Computer Science, Beijing University of Technology, Beijing 100124, China.
Sensors (Basel). 2021 Dec 10;21(24):8281. doi: 10.3390/s21248281.
Phishing has become one of the biggest and most effective cyber threats, causing hundreds of millions of dollars in losses and millions of data breaches every year. Currently, anti-phishing techniques require experts to extract phishing sites features and use third-party services to detect phishing sites. These techniques have some limitations, one of which is that extracting phishing features requires expertise and is time-consuming. Second, the use of third-party services delays the detection of phishing sites. Hence, this paper proposes an integrated phishing website detection method based on convolutional neural networks (CNN) and random forest (RF). The method can predict the legitimacy of URLs without accessing the web content or using third-party services. The proposed technique uses character embedding techniques to convert URLs into fixed-size matrices, extract features at different levels using CNN models, classify multi-level features using multiple RF classifiers, and, finally, output prediction results using a winner-take-all approach. On our dataset, a 99.35% accuracy rate was achieved using the proposed model. An accuracy rate of 99.26% was achieved on the benchmark data, much higher than that of the existing extreme model.
网络钓鱼已成为最大和最有效的网络威胁之一,每年造成数亿美元的损失和数百万起数据泄露事件。目前,反网络钓鱼技术需要专家提取网络钓鱼网站的特征,并使用第三方服务来检测网络钓鱼网站。这些技术有一些局限性,其中之一是提取网络钓鱼特征需要专业知识且耗时。其次,使用第三方服务会延迟对网络钓鱼网站的检测。因此,本文提出了一种基于卷积神经网络 (CNN) 和随机森林 (RF) 的集成网络钓鱼网站检测方法。该方法可以在不访问网页内容或使用第三方服务的情况下预测 URL 的合法性。所提出的技术使用字符嵌入技术将 URL 转换为固定大小的矩阵,使用 CNN 模型在不同级别提取特征,使用多个 RF 分类器对多级特征进行分类,最后使用“胜者为王”的方法输出预测结果。在我们的数据集上,所提出的模型实现了 99.35%的准确率。在基准数据上,准确率达到了 99.26%,远高于现有的极端模型。