Luo Zhenhao, Wang Pengfei, Xie Wei, Zhou Xu, Wang Baosheng
College of Computer, National University of Defense Technology, Changsha 410073, China.
Sensors (Basel). 2023 Sep 11;23(18):7789. doi: 10.3390/s23187789.
Binary code similarity detection (BCSD) plays a crucial role in various computer security applications, including vulnerability detection, malware detection, and software component analysis. With the development of the Internet of Things (IoT), there are many binaries from different instruction architecture sets, which require BCSD approaches robust against different architectures. In this study, we propose a novel IoT-oriented binary code similarity detection approach. Our approach leverages a customized transformer-based language model with disentangled attention to capture relative position information. To mitigate out-of-vocabulary (OOV) challenges in the language model, we introduce a base-token prediction pre-training task aimed at capturing basic semantics for unseen tokens. During function embedding generation, we integrate directed jumps, data dependency, and address adjacency to capture multiple block relations. We then assign different weights to different relations and use multi-layer Graph Convolutional Networks (GCN) to generate function embeddings. We implemented the prototype of IoTSim. Our experimental results show that our proposed block relation matrix improves IoTSim with large margins. With a pool size of 103, IoTSim achieves a recall@1 of 0.903 across architectures, outperforming the state-of-the-art approaches Trex, SAFE, and PalmTree.
二进制代码相似度检测(BCSD)在各种计算机安全应用中起着至关重要的作用,包括漏洞检测、恶意软件检测和软件组件分析。随着物联网(IoT)的发展,存在许多来自不同指令架构集的二进制文件,这就需要BCSD方法对不同架构具有鲁棒性。在本研究中,我们提出了一种新颖的面向物联网的二进制代码相似度检测方法。我们的方法利用了一种定制的基于Transformer的语言模型,通过解缠注意力来捕获相对位置信息。为了缓解语言模型中的词汇外(OOV)挑战,我们引入了一个基础令牌预测预训练任务,旨在捕获未见过的令牌的基本语义。在函数嵌入生成过程中,我们整合了定向跳转、数据依赖和地址邻接关系,以捕获多个块关系。然后,我们为不同的关系分配不同的权重,并使用多层图卷积网络(GCN)来生成函数嵌入。我们实现了IoTSim的原型。我们的实验结果表明,我们提出的块关系矩阵极大地改进了IoTSim。在池大小为103的情况下,IoTSim在所有架构上的召回率@1达到0.903,优于现有技术方法Trex、SAFE和PalmTree。