Dastani Meisam, Mardaneh Jalal, Rostamian Morteza
Infectious Diseases Research Center, Gonabad University of Medical Sciences, Gonabad, Iran.
Department of Microbiology, Infectious Diseases Research Center, School of Medicine, Gonabad University of Medical Sciences, Gonabad, Iran.
Sci Rep. 2025 May 23;15(1):18004. doi: 10.1038/s41598-025-03074-9.
This study aims to evaluate the capability of Large Language Models (LLMs) in responding to questions related to tuberculosis. Three large language models (ChatGPT, Gemini, and Copilot) were selected based on public accessibility criteria and their ability to respond to medical questions. Questions were designed across four main domains (diagnosis, treatment, prevention and control, and disease management). The responses were subsequently evaluated using DISCERN-AI and NLAT-AI assessment tools. ChatGPT achieved higher scores (4 out of 5) across all domains, while Gemini demonstrated superior performance in specific areas such as prevention and control with a score of 4.4. Copilot showed the weakest performance in disease management with a score of 3.6. In the diagnosis domain, all three models demonstrated equivalent performance (4 out of 5). According to the DISCERN-AI criteria, ChatGPT excelled in information relevance but showed deficiencies in providing sources and information production dates. All three models exhibited similar performance in balance and objectivity indicators. While all three models demonstrate acceptable capabilities in responding to medical questions related to tuberculosis, they share common limitations such as insufficient source citation and failure to acknowledge response uncertainties. Enhancement of these models could strengthen their role in providing medical information.
本研究旨在评估大语言模型(LLMs)回答与结核病相关问题的能力。基于公共可及性标准及其回答医学问题的能力,选择了三个大语言模型(ChatGPT、Gemini和Copilot)。问题围绕四个主要领域设计(诊断、治疗、预防与控制以及疾病管理)。随后使用DISCERN-AI和NLAT-AI评估工具对回答进行评估。ChatGPT在所有领域均获得较高分数(5分制下得4分),而Gemini在预防与控制等特定领域表现出色,得分为4.4。Copilot在疾病管理方面表现最弱,得分为3.6。在诊断领域,三个模型表现相当(5分制下得4分)。根据DISCERN-AI标准,ChatGPT在信息相关性方面表现出色,但在提供来源和信息生成日期方面存在不足。在平衡性和客观性指标方面,三个模型表现相似。虽然这三个模型在回答与结核病相关的医学问题时都展现出了可接受的能力,但它们都存在共同的局限性,如引用来源不足以及未承认回答的不确定性。改进这些模型可以增强它们在提供医学信息方面的作用。