Ali Mohammad Javed
Govindram Seksaria Institute of Dacryology, L.V. Prasad Eye Institute, Hyderabad, India.
Orbit. 2025 May 8:1-7. doi: 10.1080/01676830.2025.2501656.
This study aimed to report the performance of the large language model DeepSeek (DeepSeek TM, Hangzhou, China) and perform a head-to-head comparison with ChatGPT (OpenAI, San Francisco, USA) in the context of lacrimal drainage disorders.
Questions and statements were used to construct prompts to include common and uncommon aspects of lacrimal drainage disorders. Prompts avoided covering new knowledge beyond February 2024. Prompts were presented at least twice to the latest versions of DeepSeek and ChatGPT [Accessed February 15-18, 2025]. A set of assessed prompts for ChatGPT from 2023 (ChatGPT-2023) was utilized in this study. The responses of DeepSeek and ChatGPT were analyzed for evidence-based content, updated knowledge, specific responses, speed, and factual inaccuracies. The responses of the current ChatGPT were also compared with those of 2023 to assess the improvement of the artificial intelligence chatbot. Three lacrimal surgeons graded the responses into three categories: correct, partially correct, and factually incorrect. They also compared the overall quality of the response between DeepSeek and ChatGPT based on the overall content, organization, and clarity of the answers.
25 prompts were presented to the latest versions [February 2025] of DeepSeek and ChatGPT. There was no significant difference in the speed of response. The agreement among the three observers was high (96%) in grading the responses. In terms of the accuracy of the responses, both AI models were similar. DeepSeek's responses were graded as correct in 60% (15/25), partially correct in 36% (9/25), and factually incorrect in 4% (1/25). ChatGPT-2025 responses were graded as correct in 56% (14/25), partially correct in 40% (10/25), and factually incorrect in 4% (1/25). Compared to 2023, ChatGPT-2025 gave responses which were more specific, more accurate, less generic with lesser recycling of phrases. When confronted with inaccuracies, both admitted and corrected the mistakes in subsequent responses. Both the AI models demonstrated the capability of challenging incorrect prompts and premises.
DeepSeek was not superior but comparable to ChatGPT in the context of lacrimal drainage disorders. Each had unique advantages and could complement each other. They need to be specifically trained and re-trained for individual medical subspecialties.
本研究旨在报告大语言模型DeepSeek(DeepSeek TM,中国杭州)的性能,并在泪道疾病背景下与ChatGPT(美国旧金山OpenAI)进行直接比较。
使用问题和陈述来构建提示,涵盖泪道疾病的常见和罕见方面。提示避免涉及2024年2月以后的新知识。提示至少向DeepSeek和ChatGPT的最新版本展示两次[访问时间为2025年2月15日至18日]。本研究使用了一组2023年针对ChatGPT的评估提示(ChatGPT - 2023)。分析DeepSeek和ChatGPT的回答,评估其基于证据的内容、更新知识、具体回答、速度和事实错误。还将当前ChatGPT的回答与2023年的回答进行比较,以评估人工智能聊天机器人的改进情况。三位泪道外科医生将回答分为三类:正确、部分正确和事实错误。他们还根据回答的整体内容、组织和清晰度,比较了DeepSeek和ChatGPT回答的整体质量。
向DeepSeek和ChatGPT的最新版本[2025年2月]展示了25个提示。回答速度没有显著差异。三位观察者在对回答进行评分时的一致性很高(96%)。在回答的准确性方面,两个人工智能模型相似。DeepSeek的回答被评为正确的占60%(15/25),部分正确的占36%(9/),事实错误的占4%(1/25)。ChatGPT - 2025的回答被评为正确的占56%(14/25),部分正确的占40%(10/25),事实错误的占4%(1/25)。与2023年相比,ChatGPT - 2025的回答更具体、更准确、更不笼统,短语重复使用较少。当面对不准确之处时,两者都承认并在后续回答中纠正了错误。两个人工智能模型都展示了挑战错误提示和前提的能力。
在泪道疾病背景下,DeepSeek并不优于ChatGPT,但与之相当。两者各有独特优势,可以相互补充。它们需要针对各个医学亚专业进行专门训练和再训练。