College of Cybersecurity Sichuan University, Chengdu, Sichuan, P.R.China.
PLoS One. 2020 Feb 6;15(2):e0228439. doi: 10.1371/journal.pone.0228439. eCollection 2020.
In recent years, the number of vulnerabilities discovered and publicly disclosed has shown a sharp upward trend. However, the value of exploitation of vulnerabilities varies for attackers, considering that only a small fraction of vulnerabilities are exploited. Therefore, the realization of quick exclusion of the non-exploitable vulnerabilities and optimal patch prioritization on limited resources has become imperative for organizations. Recent works using machine learning techniques predict exploited vulnerabilities by extracting features from open-source intelligence (OSINT). However, in the face of explosive growth of vulnerability information, there is room for improvement in the application of past methods to multiple threat intelligence. A more general method is needed to deal with various threat intelligence sources. Moreover, in previous methods, traditional text processing methods were used to deal with vulnerability related descriptions, which only grasped the static statistical characteristics but ignored the context and the meaning of the words of the text. To address these challenges, we propose an exploit prediction model, which is based on a combination of fastText and LightGBM algorithm and called fastEmbed. We replicate key portions of the state-of-the-art work of exploit prediction and use them as benchmark models. Our model outperforms the baseline model whether in terms of the generalization ability or the prediction ability without temporal intermixing with an average overall improvement of 6.283% by learning the embedding of vulnerability-related text on extremely imbalanced data sets. Besides, in terms of predicting the exploits in the wild, our model also outperforms the baseline model with an F1 measure of 0.586 on the minority class (33.577% improvement over the work using features from darkweb/deepweb). The results demonstrate that the model can improve the ability to describe the exploitability of vulnerabilities and predict exploits in the wild effectively.
近年来,已发现和公开披露的漏洞数量呈急剧上升趋势。然而,考虑到只有一小部分漏洞被利用,攻击者对漏洞的利用价值也各不相同。因此,对于组织来说,实现快速排除不可利用的漏洞,并在有限的资源上对漏洞进行最优补丁优先级排序已经变得势在必行。最近使用机器学习技术的研究工作通过从开源情报(OSINT)中提取特征来预测被利用的漏洞。然而,面对漏洞信息的爆炸式增长,过去的方法在应用于多种威胁情报方面还有改进的空间。需要一种更通用的方法来处理各种威胁情报源。此外,在过去的方法中,传统的文本处理方法被用于处理与漏洞相关的描述,这些方法仅抓住了静态统计特征,但忽略了文本的上下文和词语的含义。为了解决这些挑战,我们提出了一种利用快速Text 和 LightGBM 算法相结合的漏洞利用预测模型,称为 fastEmbed。我们复制了漏洞利用预测的最新研究工作的关键部分,并将其用作基准模型。我们的模型在不与时间混合的情况下,无论是在泛化能力还是预测能力方面都优于基线模型,在极不平衡的数据集上通过学习与漏洞相关的文本的嵌入,平均总体提高了 6.283%。此外,在预测野外漏洞利用方面,我们的模型在少数类上的 F1 指标为 0.586,也优于基线模型(比使用暗网/深网特征的工作提高了 33.577%)。结果表明,该模型可以有效提高描述漏洞可利用性和预测野外漏洞利用的能力。