Omar Mahmud, Nassar Saleh, Hijazi Kareem, Glicksberg Benjamin S, Nadkarni Girish N, Klang Eyal
The Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, NY, USA; Maccabi Health Services, Israel.
Edith Wolfson Medical Center, Holon, Israel.
Comput Biol Med. 2025 Feb;185:109545. doi: 10.1016/j.compbiomed.2024.109545. Epub 2024 Dec 12.
BACKGROUND: Amidst the increasing use of AI in medical research, this study specifically aims to assess and compare the accuracy and credibility of openAI's GPT-4 and Google's Gemini in their ability to generate medical research introductions, focusing on the precision and reliability of their citations across five medical fields. METHODS: We compared the two models, OpenAI's GPT-4 and Google's Gemini Ultra, across five medical fields, focusing on the credibility and accuracy of citations, alongside the analysis of introduction length and unreferenced data. RESULTS: Gemini outperformed GPT-4 in reference precision. Gemini's references showed 77.2 % correctness and 68.0 % accuracy, compared to GPT-4's 54.0 % correctness and 49.2 % accuracy (p < 0.001 for both). This 23.2 percentage point difference in correctness and 18.8 in accuracy represents an improvement in citation reliability. GPT-4 generated longer introductions (332.4 ± 52.1 words vs. Gemini's 256.4 ± 39.1 words, p < 0.001) but included more unreferenced facts and assumptions (1.6 ± 1.2 vs. 1.2 ± 1.06 instances, p = 0.001). CONCLUSION: While Gemini demonstrates significantly superior performance in generating credible and accurate references for medical research introductions, both models produced fabricated evidence, limiting their reliability for reference searching. This snapshot comparison of two prominent AI models highlights the potential and limitations of AI in academic content creation. The findings underscore the critical need for verification of AI-generated academic content and call for ongoing research into evolving AI models and their applications in scientific writing.
背景:在医学研究中人工智能的使用日益增加的背景下,本研究专门旨在评估和比较OpenAI的GPT-4和谷歌的Gemini在生成医学研究引言方面的准确性和可信度,重点关注它们在五个医学领域的引用的精确性和可靠性。 方法:我们在五个医学领域对OpenAI的GPT-4和谷歌的Gemini Ultra这两个模型进行了比较,重点关注引用的可信度和准确性,同时分析引言长度和未引用的数据。 结果:Gemini在引用精确性方面优于GPT-4。Gemini的参考文献显示正确性为77.2%,准确性为68.0%,而GPT-4的正确性为54.0%,准确性为49.2%(两者p均<0.001)。正确性方面23.2个百分点的差异和准确性方面18.8个百分点的差异代表了引用可靠性的提高。GPT-4生成的引言更长(332.4±52.1词,而Gemini为256.4±39.1词,p<0.001),但包含更多未引用的事实和假设(1.6±1.2例,而Gemini为1.2±1.06例,p=0.001)。 结论:虽然Gemini在为医学研究引言生成可信且准确的参考文献方面表现出明显更优的性能,但两个模型都产生了虚假证据,限制了它们在参考文献搜索中的可靠性。对这两个著名人工智能模型的简要比较凸显了人工智能在学术内容创作中的潜力和局限性。研究结果强调了对人工智能生成的学术内容进行验证的迫切需求,并呼吁对不断发展的人工智能模型及其在科学写作中的应用进行持续研究。
Graefes Arch Clin Exp Ophthalmol. 2025-2
BMJ Open Ophthalmol. 2025-4-7
J Med Internet Res. 2023-10-30
Mayo Clin Proc Digit Health. 2025-5-23
Arch Cardiol Mex. 2025-3-11