Department of Computer Science, University of Peshawar, Peshawar, Pakistan.
Department of Computer Science, Aden Community College, Aden, Yemen.
PLoS One. 2024 May 10;19(5):e0302333. doi: 10.1371/journal.pone.0302333. eCollection 2024.
In software development, it's common to reuse existing source code by copying and pasting, resulting in the proliferation of numerous code clones-similar or identical code fragments-that detrimentally affect software quality and maintainability. Although several techniques for code clone detection exist, many encounter challenges in effectively identifying semantic clones due to their inability to extract syntax and semantics information. Fewer techniques leverage low-level source code representations like bytecode or assembly for clone detection. This work introduces a novel code representation for identifying syntactic and semantic clones in Java source code. It integrates high-level features extracted from the Abstract Syntax Tree with low-level features derived from intermediate representations generated by static analysis tools, like the Soot framework. Leveraging this combined representation, fifteen machine-learning models are trained to effectively detect code clones. Evaluation on a large dataset demonstrates the models' efficacy in accurately identifying semantic clones. Among these classifiers, ensemble classifiers, such as the LightGBM classifier, exhibit exceptional accuracy. Linearly combining features enhances the effectiveness of the models compared to multiplication and distance combination techniques. The experimental findings indicate that the proposed method can outperform the current clone detection techniques in detecting semantic clones.
在软件开发中,通过复制和粘贴来重用现有源代码是很常见的,这导致了大量代码克隆的出现——相似或相同的代码片段——这会对软件质量和可维护性造成不利影响。尽管存在几种代码克隆检测技术,但由于无法提取语法和语义信息,许多技术在有效识别语义克隆方面都遇到了挑战。较少的技术利用字节码或汇编等低级源代码表示形式来进行克隆检测。这项工作介绍了一种新的代码表示形式,用于识别 Java 源代码中的语法和语义克隆。它将从抽象语法树中提取的高级特征与静态分析工具(如 Soot 框架)生成的中间表示中提取的低级特征结合起来。利用这种组合表示形式,训练了十五个机器学习模型来有效地检测代码克隆。在大型数据集上的评估表明,这些模型在准确识别语义克隆方面非常有效。在这些分类器中,集成分类器,如 LightGBM 分类器,表现出了出色的准确性。与乘法和距离组合技术相比,线性组合特征可以提高模型的有效性。实验结果表明,与现有的克隆检测技术相比,所提出的方法可以在检测语义克隆方面表现得更好。