Yip Rui, Sun Young Joo, Bassuk Alexander G, Mahajan Vinit B
Molecular Surgery Laboratory, Stanford University, Palo Alto, California, United States of America.
Department of Ophthalmology, Byers Eye Institute, Stanford University, Palo Alto, California, United States of America.
PLOS Digit Health. 2025 May 12;4(5):e0000849. doi: 10.1371/journal.pdig.0000849. eCollection 2025 May.
There is a growing number of articles about conversational AI (i.e., ChatGPT) for generating scientific literature reviews and summaries. Yet, comparative evidence lags its wide adoption by many clinicians and researchers. We explored ChatGPT's utility for literature search from an end-user perspective through the lens of clinicians and biomedical researchers. We quantitatively compared basic versions of ChatGPT's utility against conventional search methods such as Google and PubMed. We further tested whether ChatGPT user-support tools (i.e., plugins, web-browsing function, prompt-engineering, and custom-GPTs) could improve its response across four common and practical literature search scenarios: (1) high-interest topics with an abundance of information, (2) niche topics with limited information, (3) scientific hypothesis generation, and (4) for newly emerging clinical practices questions. Our results demonstrated that basic ChatGPT functions had limitations in consistency, accuracy, and relevancy. User-support tools showed improvements, but the limitations persisted. Interestingly, each literature search scenario posed different challenges: an abundance of secondary information sources in high interest topics, and uncompelling literatures for new/niche topics. This study tested practical examples highlighting both the potential and the pitfalls of integrating conversational AI into literature search processes, and underscores the necessity for rigorous comparative assessments of AI tools in scientific research.
关于使用对话式人工智能(即ChatGPT)来生成科学文献综述和摘要的文章越来越多。然而,比较性证据却滞后于许多临床医生和研究人员对它的广泛采用。我们从临床医生和生物医学研究人员的视角,以终端用户的角度探讨了ChatGPT在文献检索方面的效用。我们对ChatGPT基本版本的效用与谷歌和PubMed等传统检索方法进行了定量比较。我们进一步测试了ChatGPT用户支持工具(即插件、网络浏览功能、提示工程和定制GPT)能否在四种常见且实用的文献检索场景中改善其回复:(1)信息丰富的高关注度主题,(2)信息有限的小众主题,(3)科学假设生成,以及(4)针对新出现的临床实践问题。我们的结果表明,ChatGPT的基本功能在一致性、准确性和相关性方面存在局限性。用户支持工具虽有改进,但局限性依然存在。有趣的是,每种文献检索场景都带来了不同的挑战:高关注度主题中有大量的二手信息来源,而新出现的/小众主题的文献则缺乏吸引力。本研究通过实际例子测试了将对话式人工智能整合到文献检索过程中的潜力和陷阱,并强调了在科学研究中对人工智能工具进行严格比较评估的必要性。