Cangrade, Inc., Watertown, MA 02472.
Information School, University of Washington, Seattle, WA 98195.
Proc Natl Acad Sci U S A. 2024 Aug 27;121(35):e2404328121. doi: 10.1073/pnas.2404328121. Epub 2024 Aug 20.
How good a research scientist is ChatGPT? We systematically probed the capabilities of GPT-3.5 and GPT-4 across four central components of the scientific process: as a Research Librarian, Research Ethicist, Data Generator, and Novel Data Predictor, using psychological science as a testing field. In Study 1 (Research Librarian), unlike human researchers, GPT-3.5 and GPT-4 hallucinated, authoritatively generating fictional references 36.0% and 5.4% of the time, respectively, although GPT-4 exhibited an evolving capacity to acknowledge its fictions. In Study 2 (Research Ethicist), GPT-4 (though not GPT-3.5) proved capable of detecting violations like p-hacking in fictional research protocols, correcting 88.6% of blatantly presented issues, and 72.6% of subtly presented issues. In Study 3 (Data Generator), both models consistently replicated patterns of cultural bias previously discovered in large language corpora, indicating that ChatGPT can simulate known results, an antecedent to usefulness for both data generation and skills like hypothesis generation. Contrastingly, in Study 4 (Novel Data Predictor), neither model was successful at predicting new results absent in their training data, and neither appeared to leverage substantially new information when predicting more vs. less novel outcomes. Together, these results suggest that GPT is a flawed but rapidly improving librarian, a decent research ethicist already, capable of data generation in simple domains with known characteristics but poor at predicting novel patterns of empirical data to aid future experimentation.
ChatGPT 作为研究型科学家表现如何?我们系统地探究了 GPT-3.5 和 GPT-4 在科学研究过程的四个核心环节中的表现:作为研究型图书管理员、研究伦理学家、数据生成器和新颖数据预测器,以心理学领域为测试领域。在研究 1(研究型图书管理员)中,GPT-3.5 和 GPT-4 与人类研究人员不同,分别有 36.0%和 5.4%的时间会产生幻觉,权威地生成虚构的参考文献,尽管 GPT-4 表现出了不断承认其虚构内容的能力。在研究 2(研究伦理学家)中,GPT-4(而非 GPT-3.5)能够检测到像 p-值操纵这样的虚构研究方案中的违规行为,纠正 88.6%的明显问题和 72.6%的微妙问题。在研究 3(数据生成器)中,两个模型都一致地复制了先前在大型语言语料库中发现的文化偏见模式,表明 ChatGPT 可以模拟已知结果,这是生成数据和假设生成等技能的有用性的前提。相比之下,在研究 4(新颖数据预测器)中,两个模型都无法成功预测其训练数据中不存在的新结果,并且在预测更不新颖的结果时,它们似乎都没有利用大量新信息。综合来看,这些结果表明 GPT 是一个有缺陷但迅速改进的图书管理员,已经是一位不错的研究伦理学家,能够在具有已知特征的简单领域生成数据,但在预测新颖的实证数据模式以帮助未来实验方面表现不佳。