College of Cybersecurity, Sichuan University, Chengdu 610065, China.
China Information Technology Security Evaluation Center, Beijing 100085, China.
PLoS One. 2019 Nov 18;14(11):e0225196. doi: 10.1371/journal.pone.0225196. eCollection 2019.
With the widespread usage of Web applications, the security issues of source code are increasing. The exposed vulnerabilities seriously endanger the interests of service providers and customers. There are some models for solving this problem. However, most of them rely on complex graphs generated from source code or regex patterns based on expert experience. In this paper, TAP, which is based on token mechanism and deep learning technology, was proposed as an analysis model to discover the vulnerabilities of PHP: Hypertext Preprocessor (PHP) Web programs conveniently and easily. Based on the token mechanism of PHP language, a custom tokenizer was designed, and it unifies tokens, supports some features of PHP and optimizes the parsing. Besides, the tokenizer also implements parameter iteration to achieve data flow analysis. On the Software Assurance Reference Dataset(SARD) and SQLI-LABS dataset, we trained the deep learning model of TAP by combining the word2vec model with Long Short-Term Memory (LSTM) network algorithm. According to the experiment on the dataset of CWE-89, TAP not only achieves the 0.9941 Area Under the Curve(AUC), which is better than other models, but also achieves the highest accuracy: 0.9787. Further, compared with RIPS, TAP shows much better in multiclass classification with 0.8319 Kappa and 0.0840 hamming distance.
随着 Web 应用程序的广泛使用,源代码的安全问题日益增多。暴露的漏洞严重危及服务提供商和客户的利益。针对这个问题已经提出了一些模型,但是大多数模型都依赖于从源代码生成的复杂图或基于专家经验的正则表达式模式。本文提出了一种基于令牌机制和深度学习技术的 TAP 分析模型,以便方便、轻松地发现 PHP: Hypertext Preprocessor (PHP) Web 程序的漏洞。基于 PHP 语言的令牌机制,设计了一个自定义的标记器,它统一了标记,支持一些 PHP 的特性并优化了解析。此外,标记器还实现了参数迭代,以实现数据流分析。在 Software Assurance Reference Dataset(SARD)和 SQLI-LABS 数据集上,我们通过结合 word2vec 模型和长短时记忆(LSTM)网络算法对 TAP 的深度学习模型进行了训练。根据 CWE-89 数据集的实验,TAP 不仅实现了 0.9941 的 AUC,优于其他模型,而且达到了最高的准确性:0.9787。此外,与 RIPS 相比,TAP 在多类分类中表现更好,kappa 值为 0.8319,汉明距离为 0.0840。