Das Deepsekhar, Narayan Atindra, Mishra Varsha, Takia Lalit, Grover Sumit, Bharati Avinav, Mb Shrijith
Ophthalmology, All India Institute of Medical Sciences, New Delhi, New Delhi, IND.
Medicine, All India Institute of Medical Sciences, New Delhi, New Delhi, IND.
Cureus. 2025 Aug 22;17(8):e90773. doi: 10.7759/cureus.90773. eCollection 2025 Aug.
Background Artificial intelligence (AI) chatbots are increasingly used in healthcare for information dissemination and clinical decision support. However, their reliability and applicability in subspecialties such as ocular oncology remain largely unassessed. This study aimed to evaluate the accuracy, completeness, readability, and real-world utility of three prominent AI chatbots, ChatGPT-4o (OpenAI, San Francisco, California, USA), DeepSeek v3 (DeepSeek, Hangzhou, Zhejiang, China), and Gemini 2.0 (Google DeepMind, London, UK), in responding to clinically relevant questions related to ocular malignancies. Methods A cross-sectional observational study was conducted at a tertiary eye care institute in Northern India. Five clinical questions, covering key ocular oncologic conditions, were created and standardized by ocular oncology experts. These prompts were input into ChatGPT-4o, DeepSeek v3, and Gemini 2.0. Responses were independently evaluated using a structured proforma assessing correctness, completeness, readability (Flesch-Kincaid score, word count, sentence count), presence of irrelevant data, applicability in the Indian healthcare setting, and reliability. Data were analyzed using Kruskal-Wallis and ANOVA statistical tests. Results All three chatbots demonstrated comparable correctness scores (mean 3.4, SD 0.49). However, four out of five responses from each chatbot were deemed incomplete. DeepSeek v3 provided the most verbose and readable answers (mean 533.8 words; Flesch score 38.0), while ChatGPT-4o generated the shortest but more clinically reliable responses (mean reliability 3.2). Gemini 2.0 exhibited the greatest variability in length and structure. No irrelevant content was observed in any chatbot responses. Only 2/5 responses from ChatGPT-4o and 1/5 from each of the other two were directly applicable to Indian clinical practice. Conclusion While AI chatbots can offer factually accurate responses to ocular oncology-related queries, they often fall short in completeness and clinical applicability. ChatGPT-4o showed the most balanced performance, though regional customization and expert oversight remain essential. Current models are not yet suitable for unsupervised use in high-stakes clinical scenarios.
背景 人工智能(AI)聊天机器人在医疗保健领域越来越多地用于信息传播和临床决策支持。然而,它们在眼部肿瘤学等亚专业中的可靠性和适用性在很大程度上仍未得到评估。本研究旨在评估三款著名的人工智能聊天机器人ChatGPT-4o(美国加利福尼亚州旧金山OpenAI公司)、DeepSeek v3(中国浙江杭州DeepSeek公司)和Gemini 2.0(英国伦敦谷歌DeepMind公司)在回答与眼部恶性肿瘤相关的临床问题时的准确性、完整性、可读性和实际效用。方法 在印度北部的一家三级眼科护理机构进行了一项横断面观察性研究。由眼部肿瘤学专家创建并标准化了五个涵盖关键眼部肿瘤疾病的临床问题。这些提示被输入到ChatGPT-4o、DeepSeek v3和Gemini 2.0中。使用结构化表格独立评估回答的正确性、完整性、可读性(弗莱什-金凯德分数、单词数、句子数)、无关数据的存在、在印度医疗环境中的适用性和可靠性。使用克鲁斯卡尔-沃利斯检验和方差分析统计测试对数据进行分析。结果 三款聊天机器人的正确性得分相当(平均3.4,标准差0.49)。然而,每个聊天机器人的五个回答中有四个被认为不完整。DeepSeek v3提供的回答最冗长且可读性最强(平均533.8个单词;弗莱什分数38.0),而ChatGPT-4o生成的回答最短但临床可靠性更高(平均可靠性3.2)。Gemini 2.0在长度和结构上表现出最大的变异性。在任何聊天机器人的回答中均未观察到无关内容。ChatGPT-4o的五分之二回答以及其他两款聊天机器人各自五分之一的回答可直接应用于印度临床实践。结论 虽然人工智能聊天机器人可以对与眼部肿瘤学相关的问题提供事实准确的回答,但它们在完整性和临床适用性方面往往存在不足。ChatGPT-4o表现出最平衡的性能,不过区域定制和专家监督仍然至关重要。当前模型尚不适合在高风险临床场景中无监督使用。