利用持续学习进行终身网络钓鱼攻击检测。

Life-long phishing attack detection using continual learning.

机构信息

Department of Computer Science, Information Technology University, Lahore, 54000, Pakistan.

出版信息

Sci Rep. 2023 Jul 17;13(1):11488. doi: 10.1038/s41598-023-37552-9.

DOI:10.1038/s41598-023-37552-9

PMID:37460588

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10352299/

Abstract

Phishing is an identity theft that employs social engineering methods to get confidential data from unwary users. A phisher frequently attempts to trick the victim into clicking a URL that leads to a malicious website. Many phishing attack victims lose their credentials and digital assets daily. This study demonstrates how the performance of traditional machine learning (ML)-based phishing detection models deteriorates over time. This failure is due to drastic changes in feature distributions caused by new phishing techniques and technological evolution over time. This paper explores continual learning (CL) techniques for sustained phishing detection performance over time. To demonstrate this behavior, we collect phishing and benign samples for three consecutive years from 2018 to 2020 and divide them into six datasets to evaluate traditional ML and proposed CL algorithms. We train a vanilla neural network (VNN) model in the CL fashion using deep feature embedding of HTML contents. We compare the proposed CL algorithms with the VNN model trained from scratch and with transfer learning (TL). We show that CL algorithms maintain accuracy over time with a tolerable deterioration of 2.45%. In contrast, VNN and TL-based models' performance deteriorates by over 20.65% and 8%, respectively.

摘要

网络钓鱼是一种身份盗窃，它采用社会工程学方法从毫无戒心的用户那里获取机密数据。网络钓鱼者通常试图欺骗受害者点击一个链接，该链接将他们引导至恶意网站。许多网络钓鱼攻击的受害者每天都会失去他们的凭证和数字资产。本研究表明，传统基于机器学习 (ML) 的网络钓鱼检测模型的性能随着时间的推移而恶化。这种失败是由于随着时间的推移，新的网络钓鱼技术和技术演变导致特征分布发生了急剧变化。本文探讨了持续学习 (CL) 技术，以实现随着时间的推移持续的网络钓鱼检测性能。为了证明这种行为，我们从 2018 年到 2020 年连续三年收集网络钓鱼和良性样本，并将它们分为六个数据集，以评估传统的 ML 和提出的 CL 算法。我们以 CL 方式使用 HTML 内容的深度特征嵌入来训练香草神经网络 (VNN) 模型。我们将提出的 CL 算法与从头开始训练的 VNN 模型和迁移学习 (TL) 进行比较。我们表明，CL 算法可以随着时间的推移保持准确性，而可容忍的恶化率为 2.45%。相比之下，VNN 和基于 TL 的模型的性能分别恶化了 20.65%和 8%。