Department of Computer Science, University of Peshawar, Peshawar, Pakistan.
Department of Computer Science, Aden Community College, Aden, Yemen.
PLoS One. 2024 Feb 2;19(2):e0296858. doi: 10.1371/journal.pone.0296858. eCollection 2024.
Code clones, referring to code fragments that are either similar or identical and are copied and pasted within software systems, have negative effects on both software quality and maintenance. The objective of this work is to systematically review and analyze recurrent neural network techniques used to detect code clones to shed light on the current techniques and offer valuable knowledge to the research community. Upon applying the review protocol, we have successfully identified 20 primary studies within this field from a total of 2099 studies. A deep investigation of these studies reveals that nine recurrent neural network techniques have been utilized for code clone detection, with a notable preference for LSTM techniques. These techniques have demonstrated their efficacy in detecting both syntactic and semantic clones, often utilizing abstract syntax trees for source code representation. Moreover, we observed that most studies applied evaluation metrics like F-score, precision, and recall. Additionally, these studies frequently utilized datasets extracted from open-source systems coded in Java and C programming languages. Notably, the Graph-LSTM technique exhibited superior performance. PyTorch and TensorFlow emerged as popular tools for implementing RNN models. To advance code clone detection research, further exploration of techniques like parallel LSTM, sentence-level LSTM, and Tree-Structured GRU is imperative. In addition, more research is needed to investigate the capabilities of the recurrent neural network techniques for identifying semantic clones across different programming languages and binary codes. The development of standardized benchmarks for languages like Python, Scratch, and C#, along with cross-language comparisons, is essential. Therefore, the utilization of recurrent neural network techniques for clone identification is a promising area that demands further research.
代码克隆是指在软件系统中复制和粘贴相似或相同的代码片段,它对软件质量和维护都有负面影响。本工作旨在系统地回顾和分析用于检测代码克隆的递归神经网络技术,以揭示当前技术,并为研究社区提供有价值的知识。通过应用审查协议,我们从总共 2099 项研究中成功确定了该领域的 20 项主要研究。对这些研究的深入调查表明,已经使用了九种递归神经网络技术来检测代码克隆,其中 LSTM 技术尤为受欢迎。这些技术已证明在检测语法和语义克隆方面非常有效,通常使用抽象语法树来表示源代码。此外,我们观察到大多数研究都应用了 F 分数、精度和召回率等评估指标。此外,这些研究经常使用从用 Java 和 C 编程语言编写的开源系统中提取的数据集。值得注意的是,Graph-LSTM 技术表现出了优越的性能。PyTorch 和 TensorFlow 成为实现 RNN 模型的流行工具。为了推进代码克隆检测研究,进一步探索并行 LSTM、句子级 LSTM 和 Tree-Structured GRU 等技术至关重要。此外,需要更多的研究来调查递归神经网络技术在不同编程语言和二进制代码中识别语义克隆的能力。开发 Python、Scratch 和 C#等语言的标准化基准以及跨语言比较是必不可少的。因此,递归神经网络技术在克隆识别中的应用是一个值得进一步研究的有前途的领域。