Hussain Shumaila, Nadeem Muhammad, Baber Junaid, Hamdi Mohammed, Rajab Adel, Al Reshan Mana Saleh, Shaikh Asadullah
Department of Computer Science, Sardar Bahadur Khan Women's University, Quetta, Pakistan.
Department of Computer Science and IT, University of Balochistan, Quetta, Pakistan.
Sci Rep. 2024 Mar 28;14(1):7406. doi: 10.1038/s41598-024-56871-z.
Software vulnerabilities pose a significant threat to system security, necessitating effective automatic detection methods. Current techniques face challenges such as dependency issues, language bias, and coarse detection granularity. This study presents a novel deep learning-based vulnerability detection system for Java code. Leveraging hybrid feature extraction through graph and sequence-based techniques enhances semantic and syntactic understanding. The system utilizes control flow graphs (CFG), abstract syntax trees (AST), program dependencies (PD), and greedy longest-match first vectorization for graph representation. A hybrid neural network (GCN-RFEMLP) and the pre-trained CodeBERT model extract features, feeding them into a quantum convolutional neural network with self-attentive pooling. The system addresses issues like long-term information dependency and coarse detection granularity, employing intermediate code representation and inter-procedural slice code. To mitigate language bias, a benchmark software assurance reference dataset is employed. Evaluations demonstrate the system's superiority, achieving 99.2% accuracy in detecting vulnerabilities, outperforming benchmark methods. The proposed approach comprehensively addresses vulnerabilities, including improper input validation, missing authorizations, buffer overflow, cross-site scripting, and SQL injection attacks listed by common weakness enumeration (CWE).
软件漏洞对系统安全构成重大威胁,因此需要有效的自动检测方法。当前的技术面临诸如依赖问题、语言偏差和检测粒度粗糙等挑战。本研究提出了一种新颖的基于深度学习的Java代码漏洞检测系统。通过基于图和序列的技术进行混合特征提取,增强了语义和句法理解。该系统利用控制流图(CFG)、抽象语法树(AST)、程序依赖(PD)以及用于图表示的贪婪最长匹配优先矢量化。一个混合神经网络(GCN-RFEMLP)和预训练的CodeBERT模型提取特征,将其输入到具有自注意力池化的量子卷积神经网络中。该系统通过采用中间代码表示和过程间切片代码来解决长期信息依赖和检测粒度粗糙等问题。为了减轻语言偏差,使用了一个基准软件保证参考数据集。评估证明了该系统的优越性,在检测漏洞方面达到了99.2%的准确率,优于基准方法。所提出的方法全面解决了漏洞问题,包括常见弱点枚举(CWE)列出的输入验证不当、授权缺失、缓冲区溢出、跨站脚本攻击和SQL注入攻击等。