Shaukat Muhammad Waqas, Amin Rashid, Muslam Muhana Magboul Ali, Alshehri Asma Hassan, Xie Jiang
Department of Computer Science, University of Engineering and Technology, Taxila 47050, Pakistan.
Department of Computer Science, University of Chakwal, Chakwal 48800, Pakistan.
Sensors (Basel). 2023 Sep 25;23(19):8070. doi: 10.3390/s23198070.
Phishing attacks are evolving with more sophisticated techniques, posing significant threats. Considering the potential of machine-learning-based approaches, our research presents a similar modern approach for web phishing detection by applying powerful machine learning algorithms. An efficient layered classification model is proposed to detect websites based on their URL structure, text, and image features. Previously, similar studies have used machine learning techniques for URL features with a limited dataset. In our research, we have used a large dataset of 20,000 website URLs, and 22 salient features from each URL are extracted to prepare a comprehensive dataset. Along with this, another dataset containing website text is also prepared for NLP-based text evaluation. It is seen that many phishing websites contain text as images, and to handle this, the text from images is extracted to classify it as spam or legitimate. The experimental evaluation demonstrated efficient and accurate phishing detection. Our layered classification model uses support vector machine (SVM), XGBoost, random forest, multilayer perceptron, linear regression, decision tree, naïve Bayes, and SVC algorithms. The performance evaluation revealed that the XGBoost algorithm outperformed other applied models with maximum accuracy and precision of 94% in the training phase and 91% in the testing phase. Multilayer perceptron also worked well with an accuracy of 91% in the testing phase. The accuracy results for random forest and decision tree were 91% and 90%, respectively. Logistic regression and SVM algorithms were used in the text-based classification, and the accuracy was found to be 87% and 88%, respectively. With these precision values, the models classified phishing and legitimate websites very well, based on URL, text, and image features. This research contributes to early detection of sophisticated phishing attacks, enhancing internet user security.
网络钓鱼攻击正随着更复杂的技术不断演变,构成了重大威胁。考虑到基于机器学习方法的潜力,我们的研究提出了一种类似的现代方法,通过应用强大的机器学习算法来检测网络钓鱼。我们提出了一种高效的分层分类模型,用于基于网站的URL结构、文本和图像特征来检测网站。此前,类似研究在有限数据集上使用机器学习技术处理URL特征。在我们的研究中,我们使用了一个包含20000个网站URL的大型数据集,并从每个URL中提取了22个显著特征,以准备一个综合数据集。与此同时,还准备了另一个包含网站文本的数据集,用于基于自然语言处理的文本评估。可以看到,许多网络钓鱼网站将文本以图像形式呈现,为处理这一情况,我们提取图像中的文本以将其分类为垃圾邮件或合法内容。实验评估表明该方法在检测网络钓鱼方面高效且准确。我们的分层分类模型使用了支持向量机(SVM)、XGBoost、随机森林、多层感知器、线性回归、决策树、朴素贝叶斯和支持向量分类(SVC)算法。性能评估显示,XGBoost算法在训练阶段的准确率和精确率最高,分别为94%,在测试阶段为91%,优于其他应用模型。多层感知器在测试阶段的准确率也达到了91%,效果良好。随机森林和决策树的准确率分别为91%和90%。逻辑回归和SVM算法用于基于文本的分类,准确率分别为87%和88%。基于这些精确值,这些模型基于URL、文本和图像特征,能够很好地对网络钓鱼网站和合法网站进行分类。这项研究有助于早期检测复杂的网络钓鱼攻击,增强互联网用户的安全性。