• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

评估大语言模型对疾病流行病学信息回答的准确性。

Evaluating the Accuracy of Responses by Large Language Models for Information on Disease Epidemiology.

作者信息

Zhu Kexin, Zhang Jiajie, Klishin Anton, Esser Mario, Blumentals William A, Juhaeri Juhaeri, Jouquelet-Royer Corinne, Sinnott Sarah-Jo

机构信息

Epidemiology and Benefit Risk, Sanofi, Bridgewater, New Jersey, USA.

Babraham Research Campus, Sanofi, Cambridge, UK.

出版信息

Pharmacoepidemiol Drug Saf. 2025 Feb;34(2):e70111. doi: 10.1002/pds.70111.

DOI:10.1002/pds.70111
PMID:39901360
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11791122/
Abstract

PURPOSE

Accurate background epidemiology of diseases are required in pharmacoepidemiologic research. We evaluated the performance of large language models (LLMs), including ChatGPT-3.5, ChatGPT-4, and Google Bard, when prompted with questions on disease frequency.

METHODS

A total of 21 questions on the prevalence and incidence of common and rare diseases were developed and submitted to each LLM twice on different dates. Benchmark data were obtained from literature searches targeting "gold-standard" references (e.g., government statistics, peer-reviewed articles). Accuracy was evaluated by comparing LLMs' responses to the benchmark data. Consistency was determined by comparing the responses to the same query submitted on different dates. The relevance and authenticity of references were evaluated.

RESULTS

Three LLMs generated 126 responses. In ChatGPT-4, 76.2% of responses were accurate, which was higher compared to 50.0% in Bard and 45.2% in ChatGPT-3.5. ChatGPT-4 exhibited higher consistency (71.4%) than Bard (57.9%) or ChatGPT-3.5 (46.7%). ChatGPT-4 provided 52 references with 27 (51.9%) providing relevant information, and all were authentic. Only 9.2% (10/109) of references from Bard were relevant. Of 65/109 unique references, 67.7% were authentic, 7.7% provided insufficient information for access, 10.8% provided inaccurate citation, and 13.8% were non-existent/fabricated. ChatGPT-3.5 did not provide any references.

CONCLUSIONS

ChatGPT-4 outperformed in retrieving information on disease epidemiology compared to Bard and ChatGPT-3.5. However, all three LLMs presented inaccurate responses, including irrelevant, incomplete, or fabricated references. Such limitations preclude the utility of the current forms of LLMs in obtaining accurate disease epidemiology by researchers in the pharmaceutical industry, in academia, or in the regulatory setting.

摘要

目的

药物流行病学研究需要准确的疾病背景流行病学数据。我们评估了大型语言模型(LLM),包括ChatGPT-3.5、ChatGPT-4和谷歌巴德,在被问及疾病频率问题时的表现。

方法

共提出21个关于常见和罕见疾病患病率和发病率的问题,并在不同日期向每个LLM提交两次。基准数据通过针对“金标准”参考文献(如政府统计数据、同行评审文章)的文献检索获得。通过比较LLM的回答与基准数据来评估准确性。通过比较对不同日期提交的相同查询的回答来确定一致性。评估参考文献的相关性和真实性。

结果

三个LLM生成了126个回答。在ChatGPT-4中,76.2%的回答是准确的,高于巴德的50.0%和ChatGPT-3.5的45.2%。ChatGPT-4表现出比巴德(57.9%)或ChatGPT-3.5(46.7%)更高的一致性(71.4%)。ChatGPT-4提供了52个参考文献,其中27个(51.9%)提供了相关信息,且均为真实的。巴德提供的参考文献中只有9.2%(10/109)是相关的。在109个唯一参考文献中,67.7%是真实的,7.7%提供的信息不足无法获取,10.8%提供了不准确的引用,13.8%不存在/是编造的。ChatGPT-3.5没有提供任何参考文献。

结论

与巴德和ChatGPT-3.5相比,ChatGPT-4在检索疾病流行病学信息方面表现更优。然而,所有三个LLM都给出了不准确的回答,包括不相关、不完整或编造的参考文献。这些局限性使得制药行业、学术界或监管机构的研究人员无法使用当前形式的LLM来获取准确的疾病流行病学信息。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/64eb/11791122/f289a846282b/PDS-34-e70111-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/64eb/11791122/16dc39220638/PDS-34-e70111-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/64eb/11791122/f289a846282b/PDS-34-e70111-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/64eb/11791122/16dc39220638/PDS-34-e70111-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/64eb/11791122/f289a846282b/PDS-34-e70111-g002.jpg

相似文献

1
Evaluating the Accuracy of Responses by Large Language Models for Information on Disease Epidemiology.评估大语言模型对疾病流行病学信息回答的准确性。
Pharmacoepidemiol Drug Saf. 2025 Feb;34(2):e70111. doi: 10.1002/pds.70111.
2
Evaluation of the Performance of Generative AI Large Language Models ChatGPT, Google Bard, and Microsoft Bing Chat in Supporting Evidence-Based Dentistry: Comparative Mixed Methods Study.评估生成式 AI 大语言模型 ChatGPT、Google Bard 和 Microsoft Bing Chat 在支持循证牙科方面的性能:比较混合方法研究。
J Med Internet Res. 2023 Dec 28;25:e51580. doi: 10.2196/51580.
3
Utility of Large Language Models for Health Care Professionals and Patients in Navigating Hematopoietic Stem Cell Transplantation: Comparison of the Performance of ChatGPT-3.5, ChatGPT-4, and Bard.大型语言模型在造血干细胞移植导航中对医疗保健专业人员和患者的实用性:ChatGPT-3.5、ChatGPT-4 和 Bard 的性能比较。
J Med Internet Res. 2024 May 17;26:e54758. doi: 10.2196/54758.
4
ChatGPT-3.5 Versus Google Bard: Which Large Language Model Responds Best to Commonly Asked Pregnancy Questions?ChatGPT-3.5与谷歌巴德:哪种大语言模型对常见的怀孕问题回答得最好?
Cureus. 2024 Jul 27;16(7):e65543. doi: 10.7759/cureus.65543. eCollection 2024 Jul.
5
Benchmarking large language models' performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard.比较分析 ChatGPT-3.5、ChatGPT-4.0 和谷歌巴德在近视防控方面的表现:大型语言模型的基准测试。
EBioMedicine. 2023 Sep;95:104770. doi: 10.1016/j.ebiom.2023.104770. Epub 2023 Aug 23.
6
Large language model comparisons between English and Chinese query performance for cardiovascular prevention.心血管疾病预防中英查询性能的大语言模型比较。
Commun Med (Lond). 2025 May 16;5(1):177. doi: 10.1038/s43856-025-00802-0.
7
Proficiency, Clarity, and Objectivity of Large Language Models Versus Specialists' Knowledge on COVID-19's Impacts in Pregnancy: Cross-Sectional Pilot Study.大型语言模型在新冠肺炎对妊娠影响方面的熟练度、清晰度和客观性与专家知识对比:横断面试点研究
JMIR Form Res. 2025 Feb 5;9:e56126. doi: 10.2196/56126.
8
Evidence-based potential of generative artificial intelligence large language models in orthodontics: a comparative study of ChatGPT, Google Bard, and Microsoft Bing.生成式人工智能大语言模型在正畸学中的循证潜力:ChatGPT、谷歌巴德和微软必应的比较研究
Eur J Orthod. 2024 Apr 13. doi: 10.1093/ejo/cjae017.
9
Comparison of Large Language Models in Answering Immuno-Oncology Questions: A Cross-Sectional Study.大型语言模型在回答免疫肿瘤学问题中的比较:一项横断面研究。
medRxiv. 2023 Oct 31:2023.10.31.23297825. doi: 10.1101/2023.10.31.23297825.
10
Evaluating Large Language Models for the National Premedical Exam in India: Comparative Analysis of GPT-3.5, GPT-4, and Bard.评估印度全国医预考用大型语言模型:GPT-3.5、GPT-4 和 Bard 的比较分析。
JMIR Med Educ. 2024 Feb 21;10:e51523. doi: 10.2196/51523.

引用本文的文献

1
HIV Prevention and Treatment Information from Four Artificial Intelligence Platforms: A Thematic Analysis.来自四个人工智能平台的HIV预防与治疗信息:一项主题分析。
AIDS Behav. 2025 Jun 7. doi: 10.1007/s10461-025-04786-9.

本文引用的文献

1
Comparison of Large Language Models in Answering Immuno-Oncology Questions: A Cross-Sectional Study.大型语言模型在回答免疫肿瘤学问题中的比较:一项横断面研究。
Oncologist. 2024 May 3;29(5):407-414. doi: 10.1093/oncolo/oyae009.
2
Large Language Model Advanced Data Analysis Abuse to Create a Fake Data Set in Medical Research.大语言模型在医学研究中滥用高级数据分析以创建虚假数据集。
JAMA Ophthalmol. 2023 Dec 1;141(12):1174-1175. doi: 10.1001/jamaophthalmol.2023.5162.
3
Artificial intelligence chatbots as sources of patient education material for obstructive sleep apnoea: ChatGPT versus Google Bard.
人工智能聊天机器人作为阻塞性睡眠呼吸暂停患者教育材料的来源:ChatGPT 与 Google Bard 对比。
Eur Arch Otorhinolaryngol. 2024 Feb;281(2):985-993. doi: 10.1007/s00405-023-08319-9. Epub 2023 Nov 2.
4
Navigating the Landscape of Personalized Medicine: The Relevance of ChatGPT, BingChat, and Bard AI in Nephrology Literature Searches.探索个性化医疗的版图:ChatGPT、必应聊天和巴德人工智能在肾脏病学文献检索中的相关性
J Pers Med. 2023 Sep 30;13(10):1457. doi: 10.3390/jpm13101457.
5
Large language models propagate race-based medicine.大语言模型传播基于种族的医学观念。
NPJ Digit Med. 2023 Oct 20;6(1):195. doi: 10.1038/s41746-023-00939-z.
6
Evaluating the Sensitivity, Specificity, and Accuracy of ChatGPT-3.5, ChatGPT-4, Bing AI, and Bard Against Conventional Drug-Drug Interactions Clinical Tools.评估ChatGPT-3.5、ChatGPT-4、必应人工智能和巴德相对于传统药物相互作用临床工具的敏感性、特异性和准确性。
Drug Healthc Patient Saf. 2023 Sep 20;15:137-147. doi: 10.2147/DHPS.S425858. eCollection 2023.
7
Assessing the accuracy and completeness of artificial intelligence language models in providing information on methotrexate use.评估人工智能语言模型在提供甲氨蝶呤使用信息方面的准确性和完整性。
Rheumatol Int. 2024 Mar;44(3):509-515. doi: 10.1007/s00296-023-05473-5. Epub 2023 Sep 25.
8
Using ChatGPT to predict the future of personalized medicine.利用 ChatGPT 预测个性化医学的未来。
Pharmacogenomics J. 2023 Nov;23(6):178-184. doi: 10.1038/s41397-023-00316-9. Epub 2023 Sep 19.
9
Fabrication and errors in the bibliographic citations generated by ChatGPT.ChatGPT生成的文献引用中的编造与错误。
Sci Rep. 2023 Sep 7;13(1):14045. doi: 10.1038/s41598-023-41032-5.
10
Evolving Landscape of Large Language Models: An Evaluation of ChatGPT and Bard in Answering Patient Queries on Colonoscopy.大语言模型的发展态势:对ChatGPT和Bard回答结肠镜检查患者问题的评估
Gastroenterology. 2024 Jan;166(1):220-221. doi: 10.1053/j.gastro.2023.08.033. Epub 2023 Aug 26.