Ghalechyan Hayk, Israyelyan Elina, Arakelyan Avag, Hovhannisyan Gerasim, Davtyan Arman
EasyDMARC, Data Science, 0014, Yerevan, Armenia.
Sci Rep. 2024 Oct 24;14(1):25134. doi: 10.1038/s41598-024-74725-6.
Cybercriminals create phishing websites that mimic legitimate websites to get sensitive information from companies, individuals, or governments. Therefore, using state-of-the-art artificial intelligence and machine learning technologies to correctly classify phishing and legitimate URLs is imperative. We report the results of applying deterministic and probabilistic neural network models to URL classification. Key achievements of this work are: (1) The development of a unique approach based on probabilistic neural networks that improves classification accuracy. (2) We show for the first time in URL phishing research that a machine learning model trained on a combination of open source and private datasets is successful in production. The dataset is constructed from open sources like Alexa, PhishTank, or OpenPhish and, most importantly, real-world production data from EasyDMARC. The daily validation of the model using daily reported URL data and corresponding labels, both from open-source platforms and private production, reach on average a 97% accuracy on the validation dataset, labeled by PhishTank, OpenPhish and EasdDMARC where possible mislabeled data can not be excluded and was not possible to check due to large number of URLs. Feature engineering was done without third-party dependencies. Lastly, the evaluation of both deterministic and probabilistic models shows high accuracy on short and long URLs, where short URLs are defined as having less than 50 characters.
网络犯罪分子创建仿冒合法网站的网络钓鱼网站,以获取公司、个人或政府的敏感信息。因此,使用最先进的人工智能和机器学习技术来正确分类网络钓鱼和合法网址势在必行。我们报告了将确定性和概率性神经网络模型应用于网址分类的结果。这项工作的主要成果包括:(1)开发了一种基于概率神经网络的独特方法,提高了分类准确率。(2)我们在网址网络钓鱼研究中首次表明,在开源数据集和私有数据集组合上训练的机器学习模型在实际应用中取得了成功。该数据集由Alexa、PhishTank或OpenPhish等开源数据构建,最重要的是,还包括来自EasyDMARC的真实生产数据。使用来自开源平台和私有生产的每日报告的网址数据及相应标签对模型进行每日验证,在由PhishTank、OpenPhish和EasdDMARC标记的验证数据集上平均达到97%的准确率,其中可能存在误标记的数据无法排除,且由于网址数量众多无法进行检查。特征工程在没有第三方依赖的情况下完成。最后,对确定性模型和概率性模型的评估在短网址和长网址上均显示出高精度,其中短网址定义为字符数少于50个的网址。