• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于机器学习模型和大语言模型对各种数据类型进行威胁分类的优化。

Optimization for threat classification of various data types-based on ML model and LLM.

作者信息

Hong Chaerim, Oh Taeyeon

机构信息

Seoul AI School, aSSIST University, Seoul, 03767, South Korea.

出版信息

Sci Rep. 2025 Jul 2;15(1):22768. doi: 10.1038/s41598-025-05182-y.

DOI:10.1038/s41598-025-05182-y
PMID:40594396
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12216055/
Abstract

With the development of AI technology, the number of cyber security threats that exploit it is increasing rapidly, and it is urgent to build an effective security threat detection system to respond to these threats. There is active research on AI-based security tools to detect and respond to these security threats. This study explores how heterogeneous data, such as signs of security attacks from security threat news and weaknesses in source code, can be analyzed integrally in an ML model and LLM environment. In this study, we applied scaling and normalization techniques to the Post News data to improve bias, and we used syntax analysis, semantic analysis, and data flow information to perform an integrated analysis of the source code to improve detection performance. It is designed to be applied to both ML models and LLM by systematizing data labeling and data formats. The results showed that the constructed learning model performed well in both text analysis and source code analysis. In the post-news data learning, the ML-based models XGBoost, SVM, and Random Forest all showed f1-scores of 0.96 to 0.97, while the LLM-based models ST5-xxl, XLNet, BERT, CodeBERT, and GraphCodeBERT all showed a score of 0.97. Additionally, in the C/C++ weakness code detection data learning, the LLM series model ST5-xxl achieved 0.9999, XLNet achieved 0.9999, BERT achieved 0.9037, CodeBERT achieved 0.9999, and GraphCodeBERT achieved 0.9999. The ML-based model XGBoost showed an accuracy of 0.9999 with the TF-IDF embedding method, SVM showed 0.9699 with the TF-IDF embedding method, and Random Forest showed 0.9493 with the TF-IDF method. The models demonstrated higher performance with the TF-IDF embedding method than with the Word2Vec embedding. This study proposed an ML and LLM integrated framework that could effectively detect source code vulnerabilities using abstract syntax trees (AST). This framework overcame the limitations of existing static analysis tools and improved detection accuracy by simultaneously considering the structural characteristics and semantic context of the code. In particular, by combining AST-based feature extraction with LLM's natural language understanding capabilities, it improved generalization performance for new types of vulnerabilities and significantly reduced false positives.

摘要

随着人工智能技术的发展,利用该技术的网络安全威胁数量正在迅速增加,因此迫切需要构建一个有效的安全威胁检测系统来应对这些威胁。目前正在积极研究基于人工智能的安全工具,以检测和应对这些安全威胁。本研究探讨了如何在机器学习模型和语言模型环境中对异构数据进行整体分析,例如来自安全威胁新闻的安全攻击迹象和源代码中的弱点。在本研究中,我们对新闻数据应用了缩放和归一化技术以改善偏差,并使用语法分析、语义分析和数据流信息对源代码进行综合分析以提高检测性能。通过将数据标记和数据格式系统化,该系统旨在应用于机器学习模型和语言模型。结果表明,构建的学习模型在文本分析和源代码分析中均表现良好。在新闻数据学习中,基于机器学习的模型XGBoost、支持向量机(SVM)和随机森林的F1分数均为0.96至0.97,而基于语言模型的模型ST5-xxl、XLNet、BERT、CodeBERT和GraphCodeBERT的分数均为0.97。此外,在C/C++弱点代码检测数据学习中,语言模型系列模型ST5-xxl的准确率为0.9999,XLNet为0.9999,BERT为0.9037,CodeBERT为0.9999,GraphCodeBERT为0.9999。基于机器学习的模型XGBoost在使用TF-IDF嵌入方法时准确率为0.9999,支持向量机在使用TF-IDF嵌入方法时为0.9699,随机森林在使用TF-IDF方法时为0.9493。与Word2Vec嵌入相比,这些模型在使用TF-IDF嵌入方法时表现出更高的性能。本研究提出了一个机器学习和语言模型集成框架,该框架可以使用抽象语法树(AST)有效地检测源代码漏洞。该框架克服了现有静态分析工具的局限性,并通过同时考虑代码的结构特征和语义上下文提高了检测准确率。特别是,通过将基于AST的特征提取与语言模型的自然语言理解能力相结合,它提高了对新型漏洞的泛化性能,并显著减少了误报。

相似文献

1
Optimization for threat classification of various data types-based on ML model and LLM.基于机器学习模型和大语言模型对各种数据类型进行威胁分类的优化。
Sci Rep. 2025 Jul 2;15(1):22768. doi: 10.1038/s41598-025-05182-y.
2
Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中,如果患者出现以下症状和体征,可判断其是否患有 COVID-19。
Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.
3
Eliciting adverse effects data from participants in clinical trials.从临床试验参与者中获取不良反应数据。
Cochrane Database Syst Rev. 2018 Jan 16;1(1):MR000039. doi: 10.1002/14651858.MR000039.pub2.
4
Serum and urine nucleic acid screening tests for BK polyomavirus-associated nephropathy in kidney and kidney-pancreas transplant recipients.肾移植和肾胰联合移植受者中BK多瘤病毒相关性肾病的血清和尿液核酸筛查试验
Cochrane Database Syst Rev. 2024 Nov 28;11(11):CD014839. doi: 10.1002/14651858.CD014839.pub2.
5
Enhancing Pulmonary Disease Prediction Using Large Language Models With Feature Summarization and Hybrid Retrieval-Augmented Generation: Multicenter Methodological Study Based on Radiology Report.使用具有特征总结和混合检索增强生成功能的大语言模型增强肺部疾病预测:基于放射学报告的多中心方法学研究
J Med Internet Res. 2025 Jun 11;27:e72638. doi: 10.2196/72638.
6
Immunogenicity and seroefficacy of pneumococcal conjugate vaccines: a systematic review and network meta-analysis.肺炎球菌结合疫苗的免疫原性和血清效力:系统评价和网络荟萃分析。
Health Technol Assess. 2024 Jul;28(34):1-109. doi: 10.3310/YWHA3079.
7
Systemic pharmacological treatments for chronic plaque psoriasis: a network meta-analysis.系统性药理学治疗慢性斑块状银屑病:网络荟萃分析。
Cochrane Database Syst Rev. 2021 Apr 19;4(4):CD011535. doi: 10.1002/14651858.CD011535.pub4.
8
Active body surface warming systems for preventing complications caused by inadvertent perioperative hypothermia in adults.用于预防成人围手术期意外低温引起并发症的主动体表升温系统。
Cochrane Database Syst Rev. 2016 Apr 21;4(4):CD009016. doi: 10.1002/14651858.CD009016.pub2.
9
Diagnostic test accuracy and cost-effectiveness of tests for codeletion of chromosomal arms 1p and 19q in people with glioma.染色体臂 1p 和 19q 缺失的检测在胶质瘤患者中的诊断准确性和成本效益。
Cochrane Database Syst Rev. 2022 Mar 2;3(3):CD013387. doi: 10.1002/14651858.CD013387.pub2.
10
Leveraging a foundation model zoo for cell similarity search in oncological microscopy across devices.利用基础模型库进行跨设备肿瘤显微镜检查中的细胞相似性搜索。
Front Oncol. 2025 Jun 18;15:1480384. doi: 10.3389/fonc.2025.1480384. eCollection 2025.

本文引用的文献

1
Locomotion rhythm makes power and speed.运动节奏产生力量和速度。
Sci Rep. 2023 Aug 28;13(1):14018. doi: 10.1038/s41598-023-41023-6.
2
A holistic and proactive approach to forecasting cyber threats.一种整体的、积极主动的网络威胁预测方法。
Sci Rep. 2023 May 17;13(1):8049. doi: 10.1038/s41598-023-35198-1.