• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于网络安全实体提取的基于改进自训练的远程标签去噪方法。

Improved self-training-based distant label denoising method for cybersecurity entity extractions.

作者信息

Zhang Ke, Wang Yunpeng, Li Ou, Hao Sirui, He Junjiang, Lan Xiaolong, Yang Jinneng, Ye Yang

机构信息

Nuclear Power Institute of China, Chengdu, China.

Smart Rongcheng Operation Center in Xindu District, Chengdu, China.

出版信息

PLoS One. 2024 Dec 17;19(12):e0315479. doi: 10.1371/journal.pone.0315479. eCollection 2024.

DOI:10.1371/journal.pone.0315479
PMID:39689105
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11651617/
Abstract

The task of named entity recognition (NER) plays a crucial role in extracting cybersecurity-related information. Existing approaches for cybersecurity entity extraction predominantly rely on manual labelling data, resulting in labour-intensive processes due to the lack of a cybersecurity-specific corpus. In this paper, we propose an improved self-training-based distant label denoising method for cybersecurity entity extraction. Firstly, we create two domain dictionaries of cybersecurity. Then, an algorithm that combines reverse maximum matching and part-of-speech tagging restrictions is proposed, for generating distant labels for the cybersecurity domain corpus. Lastly, we propose a high-confidence text selection method and an improved self-training algorithm that incorporates a teacher-student model and weight update constraints, for exploring the true labels of low-confidence text using a model trained on high-confidence text, thereby reducing the noise in the distant annotation data. Experimental results demonstrate that the cybersecurity distantly-labelled data we obtained is of high quality. Additionally, the proposed constrained self-training algorithm effectively improves the F1 score of several state-of-the-art NER models on this dataset, yielding a 3.5% improvement for the Vendor class and a 3.35% improvement for the Product class.

摘要

命名实体识别(NER)任务在提取网络安全相关信息方面起着至关重要的作用。现有的网络安全实体提取方法主要依赖人工标注数据,由于缺乏特定于网络安全的语料库,导致过程 labor-intensive。在本文中,我们提出了一种改进的基于自训练的远程标签去噪方法用于网络安全实体提取。首先,我们创建了两个网络安全领域字典。然后,提出了一种结合反向最大匹配和词性标注限制的算法,用于为网络安全领域语料库生成远程标签。最后,我们提出了一种高置信度文本选择方法和一种改进的自训练算法,该算法结合了师生模型和权重更新约束,用于使用在高置信度文本上训练的模型探索低置信度文本的真实标签,从而减少远程标注数据中的噪声。实验结果表明,我们获得的网络安全远程标注数据质量很高。此外,所提出的约束自训练算法有效地提高了几个在该数据集上的现有最先进NER模型的F1分数,供应商类别提高了3.5%,产品类别提高了3.35%。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/47c7/11651617/12beb2e395b7/pone.0315479.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/47c7/11651617/9fea58f51d75/pone.0315479.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/47c7/11651617/8f6b9e94313b/pone.0315479.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/47c7/11651617/12beb2e395b7/pone.0315479.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/47c7/11651617/9fea58f51d75/pone.0315479.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/47c7/11651617/8f6b9e94313b/pone.0315479.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/47c7/11651617/12beb2e395b7/pone.0315479.g003.jpg

相似文献

1
Improved self-training-based distant label denoising method for cybersecurity entity extractions.用于网络安全实体提取的基于改进自训练的远程标签去噪方法。
PLoS One. 2024 Dec 17;19(12):e0315479. doi: 10.1371/journal.pone.0315479. eCollection 2024.
2
Using Synthetic Health Care Data to Leverage Large Language Models for Named Entity Recognition: Development and Validation Study.利用合成医疗保健数据借助大语言模型进行命名实体识别:开发与验证研究。
J Med Internet Res. 2025 Mar 18;27:e66279. doi: 10.2196/66279.
3
DTranNER: biomedical named entity recognition with deep learning-based label-label transition model.DTranNER:基于深度学习的标签-标签转换模型的生物医学命名实体识别。
BMC Bioinformatics. 2020 Feb 11;21(1):53. doi: 10.1186/s12859-020-3393-1.
4
Biomedical named entity recognition using deep neural networks with contextual information.基于上下文信息的深度神经网络的生物医学命名实体识别。
BMC Bioinformatics. 2019 Dec 27;20(1):735. doi: 10.1186/s12859-019-3321-4.
5
A method for named entity normalization in biomedical articles: application to diseases and plants.一种生物医学文章中命名实体规范化的方法:应用于疾病和植物
BMC Bioinformatics. 2017 Oct 13;18(1):451. doi: 10.1186/s12859-017-1857-8.
6
Named-Entity-Recognition-Based Automated System for Diagnosing Cybersecurity Situations in IoT Networks.基于命名实体识别的物联网网络网络安全态势诊断自动化系统。
Sensors (Basel). 2019 Aug 1;19(15):3380. doi: 10.3390/s19153380.
7
A deep learning model incorporating part of speech and self-matching attention for named entity recognition of Chinese electronic medical records.基于词性和自匹配注意力的深度学习模型在中文电子病历命名实体识别中的应用。
BMC Med Inform Decis Mak. 2019 Apr 9;19(Suppl 2):65. doi: 10.1186/s12911-019-0762-7.
8
Dictionary-based matching graph network for biomedical named entity recognition.基于词典匹配图网络的生物医学命名实体识别。
Sci Rep. 2023 Dec 8;13(1):21667. doi: 10.1038/s41598-023-48564-w.
9
From zero to hero: Harnessing transformers for biomedical named entity recognition in zero- and few-shot contexts.从零到英雄:利用变压器在零样本和少样本上下文中进行生物医学命名实体识别。
Artif Intell Med. 2024 Oct;156:102970. doi: 10.1016/j.artmed.2024.102970. Epub 2024 Aug 24.
10
Vocabulary Matters: An Annotation Pipeline and Four Deep Learning Algorithms for Enzyme Named Entity Recognition.词汇很重要:用于酶命名实体识别的标注流水线和四个深度学习算法。
J Proteome Res. 2024 Jun 7;23(6):1915-1925. doi: 10.1021/acs.jproteome.3c00367. Epub 2024 May 11.

本文引用的文献

1
Named-Entity-Recognition-Based Automated System for Diagnosing Cybersecurity Situations in IoT Networks.基于命名实体识别的物联网网络网络安全态势诊断自动化系统。
Sensors (Basel). 2019 Aug 1;19(15):3380. doi: 10.3390/s19153380.
2
Towards reliable named entity recognition in the biomedical domain.迈向生物医学领域可靠的命名实体识别
Bioinformatics. 2020 Jan 1;36(1):280-286. doi: 10.1093/bioinformatics/btz504.