Zhang Yong, Lu Xiao, Luo Yan, Zhu Ying, Ling Wenwu
Department of Medical Ultrasound, West China Hospital of Sichuan University, 37 Guoxue Alley, Chengdu, 610041, China, 86 18980605569.
Department of Thoracic Surgery, West China Hospital of Sichuan University, Chengdu, China.
JMIR Med Inform. 2025 Jan 9;13:e63924. doi: 10.2196/63924.
Artificial intelligence chatbots are being increasingly used for medical inquiries, particularly in the field of ultrasound medicine. However, their performance varies and is influenced by factors such as language, question type, and topic.
This study aimed to evaluate the performance of ChatGPT and ERNIE Bot in answering ultrasound-related medical examination questions, providing insights for users and developers.
We curated 554 questions from ultrasound medicine examinations, covering various question types and topics. The questions were posed in both English and Chinese. Objective questions were scored based on accuracy rates, whereas subjective questions were rated by 5 experienced doctors using a Likert scale. The data were analyzed in Excel.
Of the 554 questions included in this study, single-choice questions comprised the largest share (354/554, 64%), followed by short answers (69/554, 12%) and noun explanations (63/554, 11%). The accuracy rates for objective questions ranged from 8.33% to 80%, with true or false questions scoring highest. Subjective questions received acceptability rates ranging from 47.62% to 75.36%. ERNIE Bot was superior to ChatGPT in many aspects (P<.05). Both models showed a performance decline in English, but ERNIE Bot's decline was less significant. The models performed better in terms of basic knowledge, ultrasound methods, and diseases than in terms of ultrasound signs and diagnosis.
Chatbots can provide valuable ultrasound-related answers, but performance differs by model and is influenced by language, question type, and topic. In general, ERNIE Bot outperforms ChatGPT. Users and developers should understand model performance characteristics and select appropriate models for different questions and languages to optimize chatbot use.
人工智能聊天机器人越来越多地用于医学咨询,尤其是在超声医学领域。然而,它们的表现各不相同,并受到语言、问题类型和主题等因素的影响。
本研究旨在评估ChatGPT和文心一言在回答超声相关医学检查问题方面的表现,为用户和开发者提供见解。
我们从超声医学检查中挑选了554个问题,涵盖各种问题类型和主题。问题以英文和中文提出。客观题根据准确率评分,而主观题由5名经验丰富的医生使用李克特量表进行评分。数据在Excel中进行分析。
在本研究纳入的554个问题中,单项选择题占比最大(354/554,64%),其次是简答题(69/554,12%)和名词解释(63/554,11%)。客观题的准确率在8.33%至80%之间,是非题得分最高。主观题的可接受率在47.62%至75.36%之间。文心一言在许多方面优于ChatGPT(P<0.05)。两种模型在英文方面的表现均有所下降,但文心一言的下降幅度较小。模型在基础知识、超声方法和疾病方面的表现优于超声征象和诊断方面。
聊天机器人可以提供有价值的超声相关答案,但不同模型的表现存在差异,并受到语言、问题类型和主题的影响。总体而言,文心一言的表现优于ChatGPT。用户和开发者应了解模型的性能特点,并针对不同的问题和语言选择合适的模型,以优化聊天机器人的使用。