比较大型语言模型和人类程序员在生成编程代码方面的表现。

Comparing Large Language Models and Human Programmers for Generating Programming Code.

作者信息

Hou Wenpin, Ji Zhicheng

机构信息

Department of Biostatistics, Mailman School of Public Health, Columbia University, New York City, NY, 10032, USA.

Department of Biostatistics and Bioinformatics, Duke University School of Medicine, Durham, NC, 07024, USA.

出版信息

Adv Sci (Weinh). 2025 Feb;12(8):e2412279. doi: 10.1002/advs.202412279. Epub 2024 Dec 30.

DOI:10.1002/advs.202412279

PMID:39736107

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11848527/

Abstract

The performance of seven large language models (LLMs) in generating programming code using various prompt strategies, programming languages, and task difficulties is systematically evaluated. GPT-4 substantially outperforms other LLMs, including Gemini Ultra and Claude 2. The coding performance of GPT-4 varies considerably with different prompt strategies. In most LeetCode and GeeksforGeeks coding contests evaluated in this study, GPT-4, employing the optimal prompt strategy, outperforms 85 percent of human participants in a competitive environment, many of whom are students and professionals with moderate programming experience. GPT-4 demonstrates strong capabilities in translating code between different programming languages and in learning from past errors. The computational efficiency of the code generated by GPT-4 is comparable to that of human programmers. GPT-4 is also capable of handling broader programming tasks, including front-end design and database operations. These results suggest that GPT-4 has the potential to serve as a reliable assistant in programming code generation and software development. A programming assistant is designed based on an optimal prompt strategy to facilitate the practical use of LLMs for programming.

摘要

系统评估了七种大型语言模型（LLM）在使用各种提示策略、编程语言和任务难度生成编程代码方面的性能。GPT-4的表现大幅优于其他LLM，包括Gemini Ultra和Claude 2。GPT-4的编码性能因不同的提示策略而有很大差异。在本研究评估的大多数LeetCode和GeeksforGeeks编码竞赛中，采用最优提示策略的GPT-4在竞争环境中优于85%的人类参与者，其中许多是具有中等编程经验的学生和专业人员。GPT-4在不同编程语言之间的代码翻译以及从过去的错误中学习方面表现出强大的能力。GPT-4生成的代码的计算效率与人类程序员相当。GPT-4还能够处理更广泛的编程任务，包括前端设计和数据库操作。这些结果表明，GPT-4有潜力在编程代码生成和软件开发中作为可靠的助手。基于最优提示策略设计了一个编程助手，以促进LLM在编程中的实际应用。

相似文献

Comparing Large Language Models and Human Programmers for Generating Programming Code.比较大型语言模型和人类程序员在生成编程代码方面的表现。

Adv Sci (Weinh). 2025 Feb;12(8):e2412279. doi: 10.1002/advs.202412279. Epub 2024 Dec 30.

Evaluating Bard Gemini Pro and GPT-4 Vision Against Student Performance in Medical Visual Question Answering: Comparative Case Study.在医学视觉问答中评估Bard Gemini Pro和GPT-4 Vision对学生表现的影响：比较案例研究

JMIR Form Res. 2024 Dec 17;8:e57592. doi: 10.2196/57592.

Performance of Large Language Models in Numerical Versus Semantic Medical Knowledge: Cross-Sectional Benchmarking Study on Evidence-Based Questions and Answers.大型语言模型在数值与语义医学知识方面的表现：基于循证问答的横断面基准研究

J Med Internet Res. 2025 Jul 14;27:e64452. doi: 10.2196/64452.

Extracting International Classification of Diseases Codes from Clinical Documentation Using Large Language Models.使用大语言模型从临床文档中提取国际疾病分类代码

Appl Clin Inform. 2025 Mar;16(2):337-344. doi: 10.1055/a-2491-3872. Epub 2024 Nov 28.

Use of Large Language Models to Classify Epidemiological Characteristics in Synthetic and Real-World Social Media Posts About Conjunctivitis Outbreaks: Infodemiology Study.利用大语言模型对合成及真实世界社交媒体上有关结膜炎爆发的帖子中的流行病学特征进行分类：信息流行病学研究

J Med Internet Res. 2025 Jul 2;27:e65226. doi: 10.2196/65226.

Emotional prompting amplifies disinformation generation in AI large language models.情感提示会放大人工智能大语言模型中的虚假信息生成。

Front Artif Intell. 2025 Apr 7;8:1543603. doi: 10.3389/frai.2025.1543603. eCollection 2025.

Large Language Models and Empathy: Systematic Review.大语言模型与同理心：系统综述

J Med Internet Res. 2024 Dec 11;26:e52597. doi: 10.2196/52597.

Implementing Large Language Models in Health Care: Clinician-Focused Review With Interactive Guideline.在医疗保健中应用大语言模型：以临床医生为重点的回顾与交互式指南

J Med Internet Res. 2025 Jul 11;27:e71916. doi: 10.2196/71916.

Comparative analysis of LLMs performance in medical embryology: A cross-platform study of ChatGPT, Claude, Gemini, and Copilot.大语言模型在医学胚胎学中的性能比较分析：ChatGPT、Claude、Gemini和Copilot的跨平台研究

Anat Sci Educ. 2025 May 11. doi: 10.1002/ase.70044.

Enhancing Pulmonary Disease Prediction Using Large Language Models With Feature Summarization and Hybrid Retrieval-Augmented Generation: Multicenter Methodological Study Based on Radiology Report.使用具有特征总结和混合检索增强生成功能的大语言模型增强肺部疾病预测：基于放射学报告的多中心方法学研究

J Med Internet Res. 2025 Jun 11;27:e72638. doi: 10.2196/72638.

引用本文的文献

Challenges of Implementing LLMs in Clinical Practice: Perspectives.在临床实践中应用大语言模型的挑战：观点

J Clin Med. 2025 Sep 1;14(17):6169. doi: 10.3390/jcm14176169.

Neural correlates of evaluative bias against artificial intelligence-labeled versus human-labeled artworks.针对人工智能标注与人类标注艺术品的评价偏差的神经关联。

Soc Cogn Affect Neurosci. 2025 Jan 18;20(1). doi: 10.1093/scan/nsaf071.

The dawn of a new era: can machine learning and large language models reshape QSP modeling?新时代的曙光：机器学习和大语言模型能否重塑定量系统药理学建模？

J Pharmacokinet Pharmacodyn. 2025 Jun 16;52(4):36. doi: 10.1007/s10928-025-09984-5.

Benchmarking large language models for genomic knowledge with GeneTuring.使用GeneTuring对大型语言模型进行基因组知识基准测试。

bioRxiv. 2025 Jan 5:2023.03.11.532238. doi: 10.1101/2023.03.11.532238.

本文引用的文献

Assessing GPT-4 for cell type annotation in single-cell RNA-seq analysis.在单细胞RNA测序分析中评估GPT-4用于细胞类型注释

Nat Methods. 2024 Aug;21(8):1462-1465. doi: 10.1038/s41592-024-02235-4. Epub 2024 Mar 25.

Empowering beginners in bioinformatics with ChatGPT.借助ChatGPT助力生物信息学初学者。

Quant Biol. 2023 Jun;11(2):105-108. doi: 10.15302/j-qb-023-0327. Epub 2023 Mar 31.

Revolutionizing radiology with GPT-based models: Current applications, future possibilities and limitations of ChatGPT.基于 GPT 的模型推动放射学革命：ChatGPT 的当前应用、未来可能性和局限性。

Diagn Interv Imaging. 2023 Jun;104(6):269-274. doi: 10.1016/j.diii.2023.02.003. Epub 2023 Feb 28.

Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models.ChatGPT在美国医师执照考试中的表现：使用大语言模型进行人工智能辅助医学教育的潜力。

PLOS Digit Health. 2023 Feb 9;2(2):e0000198. doi: 10.1371/journal.pdig.0000198. eCollection 2023 Feb.

Competition-level code generation with AlphaCode.使用 AlphaCode 进行竞赛级别的代码生成。

Science. 2022 Dec 9;378(6624):1092-1097. doi: 10.1126/science.abq1158. Epub 2022 Dec 8.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

比较大型语言模型和人类程序员在生成编程代码方面的表现。

Comparing Large Language Models and Human Programmers for Generating Programming Code.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献