Aragón Institute of Engineering Research (I3A), University of Zaragoza, 50018 Zaragoza, Spain.
Sensors (Basel). 2022 Nov 30;22(23):9326. doi: 10.3390/s22239326.
Cybersecurity is one of the great challenges of today's world. Rapid technological development has allowed society to prosper and improve the quality of life and the world is more dependent on new technologies. Managing security risks quickly and effectively, preventing, identifying, or mitigating them is a great challenge. The appearance of new attacks, and with more frequency, requires a constant update of threat detection methods. Traditional signature-based techniques are effective for known attacks, but they are not able to detect a new attack. For this reason, intrusion detection systems (IDS) that apply machine learning (ML) techniques represent an alternative that is gaining importance today. In this work, we have analyzed different machine learning techniques to determine which ones permit to obtain the best traffic classification results based on classification performance measurements and execution times, which is decisive for further real-time deployments. The CICIDS2017 dataset was selected in this work since it contains bidirectional traffic flows (derived from traffic captures) that include benign traffic and different types of up-to-date attacks. Each traffic flow is characterized by a set of connection-related attributes that can be used to model the traffic and distinguish between attacks and normal flows. The CICIDS2017 also contains the raw network traffic captures collected during the dataset creation in a packet-based format, thus permitting to extract the traffic flows from them. Various classification techniques have been evaluated using the Weka software: naive Bayes, logistic, multilayer perceptron, sequential minimal optimization, k-nearest neighbors, adaptive boosting, OneR, J48, PART, and random forest. As a general result, methods based on decision trees (PART, J48, and random forest) have turned out to be the most efficient with F1 values above 0.999 (average obtained in the complete dataset). Moreover, multiclass classification (distinguishing between different types of attack) and binary classification (distinguishing only between normal traffic and attack) have been compared, and the effect of reducing the number of attributes using the correlation-based feature selection (CFS) technique has been evaluated. By reducing the complexity in binary classification, better results can be obtained, and by selecting a reduced set of the most relevant attributes, less time is required (above 30% of decrease in the time required to test the model) at the cost of a small performance loss. The tree-based techniques with CFS attribute selection (six attributes selected) reached F1 values above 0.990 in the complete dataset. Finally, a conventional tool like Zeek has been used to process the raw traffic captures to identify the traffic flows and to obtain a reduced set of attributes from these flows. The classification results obtained using tree-based techniques (with 14 Zeek-based attributes) were also very high, with F1 above 0.997 (average obtained in the complete dataset) and low execution times (allowing several hundred thousand flows/s to be processed). These classification results obtained on the CICIDS2017 dataset allow us to affirm that the tree-based machine learning techniques may be appropriate in the flow-based intrusion detection problem and that algorithms, such as PART or J48, may offer a faster alternative solution to the RF technique.
网络安全是当今世界的一大挑战。快速的技术发展使社会得以繁荣发展,提高了生活质量,世界对新技术的依赖程度越来越高。快速有效地管理安全风险,预防、识别或减轻这些风险是一项巨大的挑战。新的攻击的出现,而且频率越来越高,需要不断更新威胁检测方法。基于特征的传统技术对于已知的攻击是有效的,但它们无法检测到新的攻击。出于这个原因,应用机器学习 (ML) 技术的入侵检测系统 (IDS) 代表了当今越来越重要的替代方案。在这项工作中,我们分析了不同的机器学习技术,以确定哪些技术可以基于分类性能测量和执行时间获得最佳的流量分类结果,这对于进一步的实时部署是决定性的。在这项工作中选择了 CICIDS2017 数据集,因为它包含双向流量(源自流量捕获),其中包括良性流量和各种最新攻击。每个流量流都由一组与连接相关的属性来描述,可以用来对流量进行建模并区分攻击和正常流量。CICIDS2017 还包含在数据集创建过程中以基于数据包的格式收集的原始网络流量捕获,从而可以从中提取流量流。已经使用 Weka 软件评估了各种分类技术:朴素贝叶斯、逻辑回归、多层感知器、序贯最小优化、k-最近邻、自适应增强、OneR、J48、PART 和随机森林。一般来说,基于决策树的方法(PART、J48 和随机森林)的 F1 值超过 0.999(在完整数据集上获得的平均值),因此效率最高。此外,还比较了多类分类(区分不同类型的攻击)和二进制分类(仅区分正常流量和攻击),并评估了使用基于相关性的特征选择 (CFS) 技术减少属性数量的效果。通过减少二进制分类的复杂性,可以获得更好的结果,并且通过选择最相关属性的减少集合,可以减少所需的时间(测试模型所需时间减少 30%以上),代价是性能略有下降。使用基于树的技术(选择了六个属性)并结合 CFS 属性选择,在完整数据集上的 F1 值超过 0.990。最后,像 Zeek 这样的传统工具被用来处理原始流量捕获,以识别流量流并从这些流中获得一个属性的减少集合。使用基于树的技术(具有 14 个基于 Zeek 的属性)获得的分类结果也非常高,F1 值超过 0.997(在完整数据集上获得的平均值),执行时间短(允许每秒处理数十万条流)。在 CICIDS2017 数据集上获得的这些分类结果使我们能够肯定,基于树的机器学习技术可能适合基于流的入侵检测问题,并且像 PART 或 J48 这样的算法可能为 RF 技术提供更快的替代解决方案。