Suppr超能文献

一种改进的软件源代码漏洞检测方法:多特征筛选与集成采样模型相结合

An Improved Software Source Code Vulnerability Detection Method: Combination of Multi-Feature Screening and Integrated Sampling Model.

作者信息

He Xin, Han Daoqi, Zhou Shuncheng, Fu Xueliang, Li Honghui

机构信息

College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot 010018, China.

出版信息

Sensors (Basel). 2025 Mar 14;25(6):1816. doi: 10.3390/s25061816.

Abstract

Vulnerability detection in software source code is crucial in ensuring software security. Existing models face challenges with dataset class imbalance and long training times. To address these issues, this paper introduces a multi-feature screening and integrated sampling model (MFISM) to enhance vulnerability detection efficiency and accuracy. The key innovations include (i) utilizing abstract syntax tree (AST) representation of source code to extract potential vulnerability-related features through multiple feature screening techniques; (ii) conducting analysis of variance (ANOVA) and evaluating feature selection techniques to identify representative and discriminative features; (iii) addressing class imbalance by applying an integrated over-sampling strategy to create synthetic samples from vulnerable code to expand the minority class sample size; (iv) employing outlier detection technology to filter out abnormal synthetic samples, ensuring high-quality synthesized samples. The model employs a bidirectional long short-term memory network (Bi-LSTM) to accurately identify vulnerabilities in the source code. Experimental results demonstrate that MFISM improves the F1 score performance by approximately 10% compared to existing DeepBalance methods and reduces the training time to 2-3 h. These results confirm the effectiveness and superiority of MFISM in source code vulnerability detection tasks.

摘要

软件源代码中的漏洞检测对于确保软件安全至关重要。现有模型面临数据集类别不平衡和训练时间长的挑战。为了解决这些问题,本文引入了一种多特征筛选和集成采样模型(MFISM),以提高漏洞检测的效率和准确性。关键创新包括:(i)利用源代码的抽象语法树(AST)表示,通过多种特征筛选技术提取潜在的与漏洞相关的特征;(ii)进行方差分析(ANOVA)并评估特征选择技术,以识别具有代表性和判别力的特征;(iii)通过应用集成过采样策略从易受攻击的代码中创建合成样本,以扩大少数类样本规模来解决类别不平衡问题;(iv)采用异常检测技术过滤掉异常的合成样本,确保高质量的合成样本。该模型采用双向长短期记忆网络(Bi-LSTM)来准确识别源代码中的漏洞。实验结果表明,与现有的DeepBalance方法相比,MFISM将F1分数性能提高了约10%,并将训练时间缩短至2 - 3小时。这些结果证实了MFISM在源代码漏洞检测任务中的有效性和优越性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3132/11945435/1fafefdf062b/sensors-25-01816-g012.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验