ChatGPT 作为研究科学家：探究 GPT 在研究馆员、研究伦理学家、数据生成器和数据预测者方面的能力。

ChatGPT as Research Scientist: Probing GPT's capabilities as a Research Librarian, Research Ethicist, Data Generator, and Data Predictor.

机构信息

Cangrade, Inc., Watertown, MA 02472.

Information School, University of Washington, Seattle, WA 98195.

出版信息

Proc Natl Acad Sci U S A. 2024 Aug 27;121(35):e2404328121. doi: 10.1073/pnas.2404328121. Epub 2024 Aug 20.

DOI:10.1073/pnas.2404328121

PMID:39163339

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11363351/

Abstract

How good a research scientist is ChatGPT? We systematically probed the capabilities of GPT-3.5 and GPT-4 across four central components of the scientific process: as a Research Librarian, Research Ethicist, Data Generator, and Novel Data Predictor, using psychological science as a testing field. In Study 1 (Research Librarian), unlike human researchers, GPT-3.5 and GPT-4 hallucinated, authoritatively generating fictional references 36.0% and 5.4% of the time, respectively, although GPT-4 exhibited an evolving capacity to acknowledge its fictions. In Study 2 (Research Ethicist), GPT-4 (though not GPT-3.5) proved capable of detecting violations like p-hacking in fictional research protocols, correcting 88.6% of blatantly presented issues, and 72.6% of subtly presented issues. In Study 3 (Data Generator), both models consistently replicated patterns of cultural bias previously discovered in large language corpora, indicating that ChatGPT can simulate known results, an antecedent to usefulness for both data generation and skills like hypothesis generation. Contrastingly, in Study 4 (Novel Data Predictor), neither model was successful at predicting new results absent in their training data, and neither appeared to leverage substantially new information when predicting more vs. less novel outcomes. Together, these results suggest that GPT is a flawed but rapidly improving librarian, a decent research ethicist already, capable of data generation in simple domains with known characteristics but poor at predicting novel patterns of empirical data to aid future experimentation.

摘要

ChatGPT 作为研究型科学家表现如何？我们系统地探究了 GPT-3.5 和 GPT-4 在科学研究过程的四个核心环节中的表现：作为研究型图书管理员、研究伦理学家、数据生成器和新颖数据预测器，以心理学领域为测试领域。在研究 1（研究型图书管理员）中，GPT-3.5 和 GPT-4 与人类研究人员不同，分别有 36.0%和 5.4%的时间会产生幻觉，权威地生成虚构的参考文献，尽管 GPT-4 表现出了不断承认其虚构内容的能力。在研究 2（研究伦理学家）中，GPT-4（而非 GPT-3.5）能够检测到像 p-值操纵这样的虚构研究方案中的违规行为，纠正 88.6%的明显问题和 72.6%的微妙问题。在研究 3（数据生成器）中，两个模型都一致地复制了先前在大型语言语料库中发现的文化偏见模式，表明 ChatGPT 可以模拟已知结果，这是生成数据和假设生成等技能的有用性的前提。相比之下，在研究 4（新颖数据预测器）中，两个模型都无法成功预测其训练数据中不存在的新结果，并且在预测更不新颖的结果时，它们似乎都没有利用大量新信息。综合来看，这些结果表明 GPT 是一个有缺陷但迅速改进的图书管理员，已经是一位不错的研究伦理学家，能够在具有已知特征的简单领域生成数据，但在预测新颖的实证数据模式以帮助未来实验方面表现不佳。

相似文献

ChatGPT as Research Scientist: Probing GPT's capabilities as a Research Librarian, Research Ethicist, Data Generator, and Data Predictor.ChatGPT 作为研究科学家：探究 GPT 在研究馆员、研究伦理学家、数据生成器和数据预测者方面的能力。

Proc Natl Acad Sci U S A. 2024 Aug 27;121(35):e2404328121. doi: 10.1073/pnas.2404328121. Epub 2024 Aug 20.

Stratified Evaluation of GPT's Question Answering in Surgery Reveals Artificial Intelligence (AI) Knowledge Gaps.对GPT在外科手术中问答的分层评估揭示了人工智能（AI）的知识差距。

Cureus. 2023 Nov 14;15(11):e48788. doi: 10.7759/cureus.48788. eCollection 2023 Nov.

Deriving insights from enhanced accuracy: Leveraging prompt engineering in custom GPT for assessing Chinese Nursing Licensing Exam.从更高的准确性中获取见解：在定制GPT中利用提示工程来评估中国护士执业资格考试。

Nurse Educ Pract. 2025 Mar;84:104284. doi: 10.1016/j.nepr.2025.104284. Epub 2025 Feb 4.

GPT is an effective tool for multilingual psychological text analysis.GPT 是一种用于多语言心理文本分析的有效工具。

Proc Natl Acad Sci U S A. 2024 Aug 20;121(34):e2308950121. doi: 10.1073/pnas.2308950121. Epub 2024 Aug 12.

Revolutionizing Personalized Protein Energy Malnutrition Treatment: Harnessing the Power of Chat GPT.开创个性化蛋白质能量营养不良治疗新纪元：利用 ChatGPT 之力。

Ann Biomed Eng. 2024 May;52(5):1125-1127. doi: 10.1007/s10439-023-03331-w. Epub 2023 Sep 20.

Comparing GPT-4 and Human Researchers in Health Care Data Analysis: Qualitative Description Study.GPT-4 与人类研究人员在医疗数据分析中的比较：定性描述研究。

J Med Internet Res. 2024 Aug 21;26:e56500. doi: 10.2196/56500.

Strengths and Weaknesses of ChatGPT Models for Scientific Writing About Medical Vitamin B12: Mixed Methods Study.用于医学维生素B12科学写作的ChatGPT模型的优势与不足：混合方法研究

JMIR Form Res. 2023 Nov 10;7:e49459. doi: 10.2196/49459.

AI-driven translations for kidney transplant equity in Hispanic populations.人工智能驱动的西班牙语裔人群肾移植公平性翻译。

Sci Rep. 2024 Apr 12;14(1):8511. doi: 10.1038/s41598-024-59237-7.

SensitiveCancerGPT: Leveraging Generative Large Language Model on Structured Omics Data to Optimize Drug Sensitivity Prediction.敏感癌症GPT：利用生成式大语言模型处理结构化组学数据以优化药物敏感性预测。

bioRxiv. 2025 Mar 3:2025.02.27.640661. doi: 10.1101/2025.02.27.640661.

Exploring the Boundaries of Reality: Investigating the Phenomenon of Artificial Intelligence Hallucination in Scientific Writing Through ChatGPT References.探索现实的边界：通过ChatGPT参考文献研究科学写作中的人工智能幻觉现象。

Cureus. 2023 Apr 11;15(4):e37432. doi: 10.7759/cureus.37432. eCollection 2023 Apr.

引用本文的文献

Kernels of selfhood: GPT-4o shows humanlike patterns of cognitive dissonance moderated by free choice.自我内核：GPT-4o展现出由自由选择调节的类似人类的认知失调模式。

Proc Natl Acad Sci U S A. 2025 May 20;122(20):e2501823122. doi: 10.1073/pnas.2501823122. Epub 2025 May 14.

Generalization bias in large language model summarization of scientific research.大语言模型对科学研究进行总结时的泛化偏差。

R Soc Open Sci. 2025 Apr 30;12(4):241776. doi: 10.1098/rsos.241776. eCollection 2025 Apr.

AnnCovDB: a manually curated annotation database for mutations in SARS-CoV-2 spike protein.安科维数据库（AnnCovDB）：一个人工整理的关于严重急性呼吸综合征冠状病毒2（SARS-CoV-2）刺突蛋白突变的注释数据库。

Database (Oxford). 2025 Feb 12;2025. doi: 10.1093/database/baaf002.

Should Artificial Intelligence Play a Durable Role in Biomedical Research and Practice?人工智能在生物医学研究与实践中应扮演持久的角色吗？

Int J Mol Sci. 2024 Dec 13;25(24):13371. doi: 10.3390/ijms252413371.

本文引用的文献

Autonomous chemical research with large language models.大语言模型驱动的自主化学研究。

Nature. 2023 Dec;624(7992):570-578. doi: 10.1038/s41586-023-06792-0. Epub 2023 Dec 20.

14 examples of how LLMs can transform materials science and chemistry: a reflection on a large language model hackathon.大语言模型如何改变材料科学与化学的14个实例：对一场大语言模型黑客马拉松的思考

Digit Discov. 2023 Aug 8;2(5):1233-1250. doi: 10.1039/d3dd00113j. eCollection 2023 Oct 9.

Fabrication and errors in the bibliographic citations generated by ChatGPT.ChatGPT生成的文献引用中的编造与错误。

Sci Rep. 2023 Sep 7;13(1):14045. doi: 10.1038/s41598-023-41032-5.

AI and the transformation of social science research.人工智能与社会科学研究的变革。

Science. 2023 Jun 16;380(6650):1108-1109. doi: 10.1126/science.adi1778. Epub 2023 Jun 15.

Can AI language models replace human participants?人工智能语言模型能否替代人类参与者？

Trends Cogn Sci. 2023 Jul;27(7):597-600. doi: 10.1016/j.tics.2023.04.008. Epub 2023 May 10.

Using cognitive psychology to understand GPT-3.利用认知心理学理解 GPT-3。

Proc Natl Acad Sci U S A. 2023 Feb 7;120(6):e2218523120. doi: 10.1073/pnas.2218523120. Epub 2023 Feb 2.

The project implicit international dataset: Measuring implicit and explicit social group attitudes and stereotypes across 34 countries (2009-2019).项目隐式国际数据集：跨 34 个国家（2009-2019 年）测量内隐和外显社会群体态度和刻板印象。

Behav Res Methods. 2023 Apr;55(3):1413-1440. doi: 10.3758/s13428-022-01851-2. Epub 2022 Jun 1.

Investigating the replicability of preclinical cancer biology.探究癌症生物学的临床前可重复性。

Elife. 2021 Dec 7;10:e71601. doi: 10.7554/eLife.71601.

Replicability, Robustness, and Reproducibility in Psychological Science.心理科学中的可重复性、稳健性和再现性。

Annu Rev Psychol. 2022 Jan 4;73:719-748. doi: 10.1146/annurev-psych-020821-114157. Epub 2021 Oct 19.

Highly accurate protein structure prediction with AlphaFold.利用 AlphaFold 进行高精度蛋白质结构预测。

Nature. 2021 Aug;596(7873):583-589. doi: 10.1038/s41586-021-03819-2. Epub 2021 Jul 15.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验