基于生成对抗网络的合成流量型加密货币挖掘攻击生成。

Synthetic flow-based cryptomining attack generation through Generative Adversarial Networks.

机构信息

Universidad Politécnica de Madrid, Madrid, Spain.

Universidad Complutense de Madrid, Madrid, Spain.

出版信息

Sci Rep. 2022 Feb 8;12(1):2091. doi: 10.1038/s41598-022-06057-2.

DOI:10.1038/s41598-022-06057-2

PMID:35136144

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8825844/

Abstract

Due to the growing rise of cyber attacks in the Internet, the demand of accurate intrusion detection systems (IDS) to prevent these vulnerabilities is increasing. To this aim, Machine Learning (ML) components have been proposed as an efficient and effective solution. However, its applicability scope is limited by two important issues: (i) the shortage of network traffic data datasets for attack analysis, and (ii) the data privacy constraints of the data to be used. To overcome these problems, Generative Adversarial Networks (GANs) have been proposed for synthetic flow-based network traffic generation. However, due to the ill-convergence of the GAN training, none of the existing solutions can generate high-quality fully synthetic data that can totally substitute real data in the training of ML components. In contrast, they mix real with synthetic data, which acts only as data augmentation components, leading to privacy breaches as real data is used. In sharp contrast, in this work we propose a novel and deterministic way to measure the quality of the synthetic data produced by a GAN both with respect to the real data and to its performance when used for ML tasks. As a by-product, we present a heuristic that uses these metrics for selecting the best performing generator during GAN training, leading to a novel stopping criterion, which can be applied even when different types of synthetic data are to be used in the same ML task. We demonstrate the adequacy of our proposal by generating synthetic cryptomining attacks and normal traffic flow-based data using an enhanced version of a Wasserstein GAN. The results evidence that the generated synthetic network traffic can completely replace real data when training a ML-based cryptomining detector, obtaining similar performance and avoiding privacy violations, since real data is not used in the training of the ML-based detector.

摘要

由于互联网中网络攻击的日益增多，对准确的入侵检测系统 (IDS) 的需求也在不断增加，以防止这些漏洞。为此，机器学习 (ML) 组件被提出作为一种高效、有效的解决方案。然而，其适用范围受到两个重要问题的限制：（i）用于攻击分析的网络流量数据数据集短缺，以及（ii）要使用的数据的数据隐私限制。为了克服这些问题，生成对抗网络 (GAN) 被提出用于基于流的网络流量的合成生成。然而，由于 GAN 训练的不收敛性，现有的解决方案都无法生成高质量的完全合成数据，这些数据可以完全替代 ML 组件训练中的真实数据。相反，它们将真实数据与合成数据混合，这仅作为数据增强组件，从而导致隐私泄露，因为使用了真实数据。相比之下，在这项工作中，我们提出了一种新颖而确定的方法，用于衡量 GAN 生成的合成数据相对于真实数据的质量，以及在用于 ML 任务时的性能。作为副产品，我们提出了一种启发式方法，该方法使用这些指标在 GAN 训练期间选择性能最佳的生成器，从而得出一种新颖的停止准则，即使在同一 ML 任务中要使用不同类型的合成数据时，也可以应用该准则。我们使用改进的 Wasserstein GAN 生成加密挖掘攻击和正常流量的基于流的合成数据来证明我们的提议的充分性。结果表明，在训练基于 ML 的加密挖掘检测器时，生成的合成网络流量可以完全替代真实数据，同时获得相似的性能并避免隐私侵犯，因为在基于 ML 的检测器的训练中不使用真实数据。