Medina-Arco Joaquín Gaspar, Magán-Carrión Roberto, Rodríguez-Gómez Rafael Alejandro, García-Teodoro Pedro
Network Engineering & Security Group (NESG), University of Granada, 18012 Granada, Spain.
Sensors (Basel). 2024 Jan 12;24(2):479. doi: 10.3390/s24020479.
With the significant increase in cyber-attacks and attempts to gain unauthorised access to systems and information, Network Intrusion-Detection Systems (NIDSs) have become essential detection tools. Anomaly-based systems use machine learning techniques to distinguish between normal and anomalous traffic. They do this by using training datasets that have been previously gathered and labelled, allowing them to learn to detect anomalies in future data. However, such datasets can be accidentally or deliberately contaminated, compromising the performance of NIDS. This has been the case of the UGR'16 dataset, in which, during the labelling process, botnet-type attacks were not identified in the subset intended for training. This paper addresses the mislabelling problem of real network traffic datasets by introducing a novel methodology that (i) allows analysing the quality of a network traffic dataset by identifying possible hidden or unidentified anomalies and (ii) selects the ideal subset of data to optimise the performance of the anomaly detection model even in the presence of hidden attacks erroneously labelled as normal network traffic. To this end, a two-step process that makes incremental use of the training dataset is proposed. Experiments conducted on the contaminated UGR'16 dataset in conjunction with the state-of-the-art NIDS, Kitsune, conclude with the feasibility of the approach to reveal observations of hidden botnet-based attacks on this dataset.
随着网络攻击以及未经授权访问系统和信息的企图显著增加,网络入侵检测系统(NIDS)已成为必不可少的检测工具。基于异常的系统使用机器学习技术来区分正常流量和异常流量。它们通过使用先前收集并标记的训练数据集来做到这一点,从而使它们能够学会检测未来数据中的异常。然而,这样的数据集可能会被意外或故意污染,从而损害NIDS的性能。UGR'16数据集就是这种情况,在该数据集中,在标记过程中,在用于训练的子集中未识别出僵尸网络类型的攻击。本文通过引入一种新颖的方法来解决真实网络流量数据集的错误标记问题,该方法(i)通过识别可能隐藏或未识别的异常来分析网络流量数据集的质量,并且(ii)选择理想的数据子集以优化异常检测模型的性能,即使存在被错误标记为正常网络流量的隐藏攻击。为此,提出了一个逐步使用训练数据集的两步过程。结合最先进的NIDS Kitsune对受污染的UGR'16数据集进行的实验得出结论,该方法可揭示此数据集上基于隐藏僵尸网络攻击的观测结果,具有可行性。