Zhang Ke, Wang Yunpeng, Li Ou, Hao Sirui, He Junjiang, Lan Xiaolong, Yang Jinneng, Ye Yang
Nuclear Power Institute of China, Chengdu, China.
Smart Rongcheng Operation Center in Xindu District, Chengdu, China.
PLoS One. 2024 Dec 17;19(12):e0315479. doi: 10.1371/journal.pone.0315479. eCollection 2024.
The task of named entity recognition (NER) plays a crucial role in extracting cybersecurity-related information. Existing approaches for cybersecurity entity extraction predominantly rely on manual labelling data, resulting in labour-intensive processes due to the lack of a cybersecurity-specific corpus. In this paper, we propose an improved self-training-based distant label denoising method for cybersecurity entity extraction. Firstly, we create two domain dictionaries of cybersecurity. Then, an algorithm that combines reverse maximum matching and part-of-speech tagging restrictions is proposed, for generating distant labels for the cybersecurity domain corpus. Lastly, we propose a high-confidence text selection method and an improved self-training algorithm that incorporates a teacher-student model and weight update constraints, for exploring the true labels of low-confidence text using a model trained on high-confidence text, thereby reducing the noise in the distant annotation data. Experimental results demonstrate that the cybersecurity distantly-labelled data we obtained is of high quality. Additionally, the proposed constrained self-training algorithm effectively improves the F1 score of several state-of-the-art NER models on this dataset, yielding a 3.5% improvement for the Vendor class and a 3.35% improvement for the Product class.
命名实体识别(NER)任务在提取网络安全相关信息方面起着至关重要的作用。现有的网络安全实体提取方法主要依赖人工标注数据,由于缺乏特定于网络安全的语料库,导致过程 labor-intensive。在本文中,我们提出了一种改进的基于自训练的远程标签去噪方法用于网络安全实体提取。首先,我们创建了两个网络安全领域字典。然后,提出了一种结合反向最大匹配和词性标注限制的算法,用于为网络安全领域语料库生成远程标签。最后,我们提出了一种高置信度文本选择方法和一种改进的自训练算法,该算法结合了师生模型和权重更新约束,用于使用在高置信度文本上训练的模型探索低置信度文本的真实标签,从而减少远程标注数据中的噪声。实验结果表明,我们获得的网络安全远程标注数据质量很高。此外,所提出的约束自训练算法有效地提高了几个在该数据集上的现有最先进NER模型的F1分数,供应商类别提高了3.5%,产品类别提高了3.35%。