• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

生成可信的引用医学研究:OpenAI的GPT-4与谷歌的Gemini的比较研究

Generating credible referenced medical research: A comparative study of openAI's GPT-4 and Google's gemini.

作者信息

Omar Mahmud, Nassar Saleh, Hijazi Kareem, Glicksberg Benjamin S, Nadkarni Girish N, Klang Eyal

机构信息

The Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, NY, USA; Maccabi Health Services, Israel.

Edith Wolfson Medical Center, Holon, Israel.

出版信息

Comput Biol Med. 2025 Feb;185:109545. doi: 10.1016/j.compbiomed.2024.109545. Epub 2024 Dec 12.

DOI:10.1016/j.compbiomed.2024.109545
PMID:39667055
Abstract

BACKGROUND

Amidst the increasing use of AI in medical research, this study specifically aims to assess and compare the accuracy and credibility of openAI's GPT-4 and Google's Gemini in their ability to generate medical research introductions, focusing on the precision and reliability of their citations across five medical fields.

METHODS

We compared the two models, OpenAI's GPT-4 and Google's Gemini Ultra, across five medical fields, focusing on the credibility and accuracy of citations, alongside the analysis of introduction length and unreferenced data.

RESULTS

Gemini outperformed GPT-4 in reference precision. Gemini's references showed 77.2 % correctness and 68.0 % accuracy, compared to GPT-4's 54.0 % correctness and 49.2 % accuracy (p < 0.001 for both). This 23.2 percentage point difference in correctness and 18.8 in accuracy represents an improvement in citation reliability. GPT-4 generated longer introductions (332.4 ± 52.1 words vs. Gemini's 256.4 ± 39.1 words, p < 0.001) but included more unreferenced facts and assumptions (1.6 ± 1.2 vs. 1.2 ± 1.06 instances, p = 0.001).

CONCLUSION

While Gemini demonstrates significantly superior performance in generating credible and accurate references for medical research introductions, both models produced fabricated evidence, limiting their reliability for reference searching. This snapshot comparison of two prominent AI models highlights the potential and limitations of AI in academic content creation. The findings underscore the critical need for verification of AI-generated academic content and call for ongoing research into evolving AI models and their applications in scientific writing.

摘要

背景

在医学研究中人工智能的使用日益增加的背景下,本研究专门旨在评估和比较OpenAI的GPT-4和谷歌的Gemini在生成医学研究引言方面的准确性和可信度,重点关注它们在五个医学领域的引用的精确性和可靠性。

方法

我们在五个医学领域对OpenAI的GPT-4和谷歌的Gemini Ultra这两个模型进行了比较,重点关注引用的可信度和准确性,同时分析引言长度和未引用的数据。

结果

Gemini在引用精确性方面优于GPT-4。Gemini的参考文献显示正确性为77.2%,准确性为68.0%,而GPT-4的正确性为54.0%,准确性为49.2%(两者p均<0.001)。正确性方面23.2个百分点的差异和准确性方面18.8个百分点的差异代表了引用可靠性的提高。GPT-4生成的引言更长(332.4±52.1词,而Gemini为256.4±39.1词,p<0.001),但包含更多未引用的事实和假设(1.6±1.2例,而Gemini为1.2±1.06例,p=0.001)。

结论

虽然Gemini在为医学研究引言生成可信且准确的参考文献方面表现出明显更优的性能,但两个模型都产生了虚假证据,限制了它们在参考文献搜索中的可靠性。对这两个著名人工智能模型的简要比较凸显了人工智能在学术内容创作中的潜力和局限性。研究结果强调了对人工智能生成的学术内容进行验证的迫切需求,并呼吁对不断发展的人工智能模型及其在科学写作中的应用进行持续研究。

相似文献

1
Generating credible referenced medical research: A comparative study of openAI's GPT-4 and Google's gemini.生成可信的引用医学研究:OpenAI的GPT-4与谷歌的Gemini的比较研究
Comput Biol Med. 2025 Feb;185:109545. doi: 10.1016/j.compbiomed.2024.109545. Epub 2024 Dec 12.
2
Evaluating the image recognition capabilities of GPT-4V and Gemini Pro in the Japanese national dental examination.评估GPT-4V和Gemini Pro在日本国家牙科考试中的图像识别能力。
J Dent Sci. 2025 Jan;20(1):368-372. doi: 10.1016/j.jds.2024.06.015. Epub 2024 Jul 2.
3
Gemini AI vs. ChatGPT: A comprehensive examination alongside ophthalmology residents in medical knowledge.Gemini人工智能与ChatGPT对比:与眼科住院医师一起对医学知识进行的全面考察
Graefes Arch Clin Exp Ophthalmol. 2025 Feb;263(2):527-536. doi: 10.1007/s00417-024-06625-4. Epub 2024 Sep 15.
4
Evaluating Bard Gemini Pro and GPT-4 Vision Against Student Performance in Medical Visual Question Answering: Comparative Case Study.在医学视觉问答中评估Bard Gemini Pro和GPT-4 Vision对学生表现的影响:比较案例研究
JMIR Form Res. 2024 Dec 17;8:e57592. doi: 10.2196/57592.
5
Comparing the Performance of Popular Large Language Models on the National Board of Medical Examiners Sample Questions.比较流行的大语言模型在国家医学考试委员会样题上的表现。
Cureus. 2024 Mar 11;16(3):e55991. doi: 10.7759/cureus.55991. eCollection 2024 Mar.
6
Can off-the-shelf visual large language models detect and diagnose ocular diseases from retinal photographs?现成的视觉大语言模型能否从视网膜照片中检测和诊断眼部疾病?
BMJ Open Ophthalmol. 2025 Apr 7;10(1):e002076. doi: 10.1136/bmjophth-2024-002076.
7
Large Language Models for Therapy Recommendations Across 3 Clinical Specialties: Comparative Study.大型语言模型在 3 个临床专业领域的治疗推荐中的应用:比较研究。
J Med Internet Res. 2023 Oct 30;25:e49324. doi: 10.2196/49324.
8
Harnessing advanced large language models in otolaryngology board examinations: an investigation using python and application programming interfaces.在耳鼻喉科委员会考试中利用先进的大语言模型:使用Python和应用程序编程接口的调查
Eur Arch Otorhinolaryngol. 2025 Apr 25. doi: 10.1007/s00405-025-09404-x.
9
Assessing GPT-4's Performance in Delivering Medical Advice: Comparative Analysis With Human Experts.评估 GPT-4 提供医疗建议的表现:与人类专家的比较分析。
JMIR Med Educ. 2024 Jul 8;10:e51282. doi: 10.2196/51282.
10
The Performance of ChatGPT-4 and Gemini Ultra 1.0 for Quality Assurance Review in Emergency Medical Services Chest Pain Calls.ChatGPT-4和Gemini Ultra 1.0在紧急医疗服务胸痛呼叫质量保证审查中的表现。
Prehosp Emerg Care. 2025;29(3):210-217. doi: 10.1080/10903127.2024.2376757. Epub 2024 Jul 22.

引用本文的文献

1
Multi-model assurance analysis showing large language models are highly vulnerable to adversarial hallucination attacks during clinical decision support.多模型保证分析表明,在临床决策支持过程中,大语言模型极易受到对抗性幻觉攻击。
Commun Med (Lond). 2025 Aug 2;5(1):330. doi: 10.1038/s43856-025-01021-3.
2
Battle of the Bots: Solving Clinical Cases in Osteoarticular Infections With Large Language Models.人工智能之战:利用大语言模型解决骨关节炎感染的临床病例
Mayo Clin Proc Digit Health. 2025 May 23;3(3):100230. doi: 10.1016/j.mcpdig.2025.100230. eCollection 2025 Sep.
3
HIV Prevention and Treatment Information from Four Artificial Intelligence Platforms: A Thematic Analysis.
来自四个人工智能平台的HIV预防与治疗信息:一项主题分析。
AIDS Behav. 2025 Jun 7. doi: 10.1007/s10461-025-04786-9.
4
Benchmarking the Confidence of Large Language Models in Answering Clinical Questions: Cross-Sectional Evaluation Study.评估大型语言模型回答临床问题的可信度:横断面评估研究
JMIR Med Inform. 2025 May 16;13:e66917. doi: 10.2196/66917.
5
[Applications of artificial intelligence in science: a practical guide for editors and authors].[人工智能在科学中的应用:编辑和作者实用指南]
Arch Cardiol Mex. 2025 Mar 11;95(2):135-7. doi: 10.24875/ACM.24000240.