Hou Wenpin, Ji Zhicheng
Department of Biostatistics, Mailman School of Public Health, Columbia University, New York City, NY, 10032, USA.
Department of Biostatistics and Bioinformatics, Duke University School of Medicine, Durham, NC, 07024, USA.
Adv Sci (Weinh). 2025 Feb;12(8):e2412279. doi: 10.1002/advs.202412279. Epub 2024 Dec 30.
The performance of seven large language models (LLMs) in generating programming code using various prompt strategies, programming languages, and task difficulties is systematically evaluated. GPT-4 substantially outperforms other LLMs, including Gemini Ultra and Claude 2. The coding performance of GPT-4 varies considerably with different prompt strategies. In most LeetCode and GeeksforGeeks coding contests evaluated in this study, GPT-4, employing the optimal prompt strategy, outperforms 85 percent of human participants in a competitive environment, many of whom are students and professionals with moderate programming experience. GPT-4 demonstrates strong capabilities in translating code between different programming languages and in learning from past errors. The computational efficiency of the code generated by GPT-4 is comparable to that of human programmers. GPT-4 is also capable of handling broader programming tasks, including front-end design and database operations. These results suggest that GPT-4 has the potential to serve as a reliable assistant in programming code generation and software development. A programming assistant is designed based on an optimal prompt strategy to facilitate the practical use of LLMs for programming.
系统评估了七种大型语言模型(LLM)在使用各种提示策略、编程语言和任务难度生成编程代码方面的性能。GPT-4的表现大幅优于其他LLM,包括Gemini Ultra和Claude 2。GPT-4的编码性能因不同的提示策略而有很大差异。在本研究评估的大多数LeetCode和GeeksforGeeks编码竞赛中,采用最优提示策略的GPT-4在竞争环境中优于85%的人类参与者,其中许多是具有中等编程经验的学生和专业人员。GPT-4在不同编程语言之间的代码翻译以及从过去的错误中学习方面表现出强大的能力。GPT-4生成的代码的计算效率与人类程序员相当。GPT-4还能够处理更广泛的编程任务,包括前端设计和数据库操作。这些结果表明,GPT-4有潜力在编程代码生成和软件开发中作为可靠的助手。基于最优提示策略设计了一个编程助手,以促进LLM在编程中的实际应用。