基于命名函数和公共向量空间的二进制代码相似性分析

Xia Bing, Pang Jianmin, Zhou Xin, Shan Zheng, Wang Junchao, Yue Feng

State Key Laboratory of Mathematical Engineering and Advanced Computing, Zhengzhou, China.

Zhongyuan University of Technology, Zhengzhou, China.

Sci Rep. 2023 Sep 21;13(1):15676. doi: 10.1038/s41598-023-42769-9.

Binary code similarity analysis is widely used in the field of vulnerability search where source code may not be available to detect whether two binary functions are similar or not. Based on deep learning and natural processing techniques, several approaches have been proposed to perform cross-platform binary code similarity analysis using control flow graphs. However, existing schemes suffer from the shortcomings of large differences in instruction syntaxes across different target platforms, inability to align control flow graph nodes, and less introduction of high-level semantics of stability, which pose challenges for identifying similar computations between binary functions of different platforms generated from the same source code. We argue that extracting stable, platform-independent semantics can improve model accuracy, and a cross-platform binary function similarity comparison model N_Match is proposed. The model elevates different platform instructions to the same semantic space to shield their underlying platform instruction differences, uses graph embedding technology to learn the stability semantics of neighbors, extracts high-level knowledge of naming function to alleviate the differences brought about by cross-platform and cross-optimization levels, and combines the stable graph structure as well as the stable, platform-independent API knowledge of naming function to represent the final semantics of functions. The experimental results show that the model accuracy of N_Match outperforms the baseline model in terms of cross-platform, cross-optimization level, and industrial scenarios. In the vulnerability search experiment, N_Match significantly improves hit@N, the mAP exceeds the current graph embedding model by 66%. In addition, we also give several interesting observations from the experiments. The code and model are publicly available at https://www.github.com/CSecurityZhongYuan/Binary-Name_Match .

二进制代码相似性分析在漏洞搜索领域被广泛应用，在无法获取源代码的情况下，用于检测两个二进制函数是否相似。基于深度学习和自然处理技术，已经提出了几种方法来使用控制流图进行跨平台二进制代码相似性分析。然而，现有方案存在以下缺点：不同目标平台的指令语法差异大、无法对齐控制流图节点以及较少引入稳定性的高级语义，这给识别由相同源代码生成的不同平台二进制函数之间的相似计算带来了挑战。我们认为提取稳定的、与平台无关的语义可以提高模型准确性，并提出了一种跨平台二进制函数相似性比较模型N_Match。该模型将不同平台的指令提升到相同的语义空间，以屏蔽其底层平台指令差异，使用图嵌入技术学习邻居的稳定性语义，提取命名函数的高级知识以减轻跨平台和跨优化级别带来的差异，并结合稳定的图结构以及命名函数的稳定的、与平台无关的API知识来表示函数的最终语义。实验结果表明，N_Match的模型准确性在跨平台、跨优化级别和工业场景方面优于基线模型。在漏洞搜索实验中，N_Match显著提高了hit@N，mAP比当前的图嵌入模型高出66%。此外，我们还从实验中给出了一些有趣的观察结果。代码和模型可在https://www.github.com/CSecurityZhongYuan/Binary-Name_Match上公开获取。

相似文献

Sci Rep. 2023 Sep 21;13(1):15676. doi: 10.1038/s41598-023-42769-9.

Cross-platform binary code similarity detection based on NMT and graph embedding.

Math Biosci Eng. 2021 May 25;18(4):4528-4551. doi: 10.3934/mbe.2021230.

Semantic aware-based instruction embedding for binary code similarity detection.

PLoS One. 2024 Jun 11;19(6):e0305299. doi: 10.1371/journal.pone.0305299. eCollection 2024.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

LaGAT: link-aware graph attention network for drug-drug interaction prediction.

Bioinformatics. 2022 Dec 13;38(24):5406-5412. doi: 10.1093/bioinformatics/btac682.

IoTSim: Internet of Things-Oriented Binary Code Similarity Detection with Multiple Block Relations.

Sensors (Basel). 2023 Sep 11;23(18):7789. doi: 10.3390/s23187789.

A Knowledge Graph Entity Disambiguation Method Based on Entity-Relationship Embedding and Graph Structure Embedding.

Comput Intell Neurosci. 2021 Sep 23;2021:2878189. doi: 10.1155/2021/2878189. eCollection 2021.

Use of word and graph embedding to measure semantic relatedness between Unified Medical Language System concepts.

J Am Med Inform Assoc. 2020 Oct 1;27(10):1538-1546. doi: 10.1093/jamia/ocaa136.

An effective knowledge graph entity alignment model based on multiple information.

Neural Netw. 2023 May;162:83-98. doi: 10.1016/j.neunet.2023.02.029. Epub 2023 Feb 24.

Spherical hashing: binary code embedding with hyperspheres.

IEEE Trans Pattern Anal Mach Intell. 2015 Nov;37(11):2304-16. doi: 10.1109/TPAMI.2015.2408363.

引用本文的文献

Identifying shader sub-patterns for GPU performance tuning and architecture design.

Sci Rep. 2024 Oct 14;14(1):24036. doi: 10.1038/s41598-024-68974-8.

Suppr 超能文献

核心技术专利：CN118964589B侵权必究

相似文献

Sci Rep. 2023 Sep 21;13(1):15676. doi: 10.1038/s41598-023-42769-9.

Cross-platform binary code similarity detection based on NMT and graph embedding.

Math Biosci Eng. 2021 May 25;18(4):4528-4551. doi: 10.3934/mbe.2021230.

Semantic aware-based instruction embedding for binary code similarity detection.

PLoS One. 2024 Jun 11;19(6):e0305299. doi: 10.1371/journal.pone.0305299. eCollection 2024.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

LaGAT: link-aware graph attention network for drug-drug interaction prediction.

Bioinformatics. 2022 Dec 13;38(24):5406-5412. doi: 10.1093/bioinformatics/btac682.

IoTSim: Internet of Things-Oriented Binary Code Similarity Detection with Multiple Block Relations.

Sensors (Basel). 2023 Sep 11;23(18):7789. doi: 10.3390/s23187789.

A Knowledge Graph Entity Disambiguation Method Based on Entity-Relationship Embedding and Graph Structure Embedding.

Comput Intell Neurosci. 2021 Sep 23;2021:2878189. doi: 10.1155/2021/2878189. eCollection 2021.

Use of word and graph embedding to measure semantic relatedness between Unified Medical Language System concepts.

J Am Med Inform Assoc. 2020 Oct 1;27(10):1538-1546. doi: 10.1093/jamia/ocaa136.

An effective knowledge graph entity alignment model based on multiple information.

Neural Netw. 2023 May;162:83-98. doi: 10.1016/j.neunet.2023.02.029. Epub 2023 Feb 24.

Spherical hashing: binary code embedding with hyperspheres.

IEEE Trans Pattern Anal Mach Intell. 2015 Nov;37(11):2304-16. doi: 10.1109/TPAMI.2015.2408363.

引用本文的文献

Identifying shader sub-patterns for GPU performance tuning and architecture design.

Sci Rep. 2024 Oct 14;14(1):24036. doi: 10.1038/s41598-024-68974-8.

Binary code similarity analysis based on naming function and common vector space.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献