Wan Bangrui, Zhou Jianjun, Wang Ying, Chen Feng, Qian Ying
School of Software Engineering, Chongqing University of Posts and Telecommunications, Chongqing, China.
Chongqing Engineering Research Center of Software Quality Assurance, Testing and Assessment, Chongqing, China.
PeerJ Comput Sci. 2025 Jan 17;11:e2504. doi: 10.7717/peerj-cs.2504. eCollection 2025.
Binary code similarity detection (BCSD) aims to identify whether a pair of binary code snippets is similar, which is widely used for tasks such as malware analysis, patch analysis, and clone detection. Current state-of-the-art approaches are based on Transformer, which require substantial computation resources. Learning-based approaches remains room for optimization in learning the deeper semantics of binary code. In this paper, we propose MSSA, a multi-stage semantic-aware neural network for BCSD at the function level. It effectively integrates the semantic and structural information of assembly instructions within and between basic blocks, and across the entire function through four semantic-aware neural networks, achieving deep understanding of binary code semantics. MSSA is a lightweight model with only 0.38M parameters in its backbone network, suitable for deployment in CPU environments. Experimental results show that MSSA outperforms Gemini, Asm2Vec, SAFE, and jTrans in classification performance and ranks second only to the Transformer-based jTrans in retrieval performance.
二进制代码相似度检测(BCSD)旨在识别一对二进制代码片段是否相似,它广泛应用于恶意软件分析、补丁分析和克隆检测等任务。当前的先进方法基于Transformer,这需要大量的计算资源。基于学习的方法在学习二进制代码的深层语义方面仍有优化空间。在本文中,我们提出了MSSA,一种用于函数级BCSD的多阶段语义感知神经网络。它通过四个语义感知神经网络有效地整合了基本块内和基本块之间以及整个函数中的汇编指令的语义和结构信息,实现了对二进制代码语义的深度理解。MSSA是一个轻量级模型,其骨干网络中只有0.38M个参数,适合在CPU环境中部署。实验结果表明,MSSA在分类性能上优于Gemini、Asm2Vec、SAFE和jTrans,在检索性能上仅次于基于Transformer的jTrans,排名第二。