Kumari Amita, Kumari Anita, Singh Amita, Singh Sanjeet K, Juhi Ayesha, Dhanvijay Anup Kumar D, Pinjar Mohammed Jaffer, Mondal Himel
Physiology, All India Institute of Medical Sciences, Deoghar, Deoghar, IND.
Pathology, All India Institute of Medical Sciences, Deoghar, Deoghar, IND.
Cureus. 2023 Aug 21;15(8):e43861. doi: 10.7759/cureus.43861. eCollection 2023 Aug.
Background Large language models (LLMs), such as ChatGPT-3.5, Google Bard, and Microsoft Bing, have shown promising capabilities in various natural language processing (NLP) tasks. However, their performance and accuracy in solving domain-specific questions, particularly in the field of hematology, have not been extensively investigated. Objective This study aimed to explore the capability of LLMs, namely, ChatGPT-3.5, Google Bard, and Microsoft Bing (Precise), in solving hematology-related cases and comparing their performance. Methods This was a cross-sectional study conducted in the Department of Physiology and Pathology, All India Institute of Medical Sciences, Deoghar, Jharkhand, India. We curated a set of 50 cases on hematology covering a range of topics and complexities. The dataset included queries related to blood disorders, hematologic malignancies, laboratory test parameters, calculations, and treatment options. Each case and related question was prepared with a set of correct answers to compare with. We utilized ChatGPT-3.5, Google Bard Experiment, and Microsoft Bing (Precise) for question-answering tasks. The answers were checked by two physiologists and one pathologist. They rated the answers on a rating scale from one to five. The average score of the three models was compared by Friedman's test with Dunn's post-hoc test. The performance of the LLMs was compared with a median of 2.5 by a one-sample median test as the curriculum from which the questions were curated has a 50% pass grade. Results The scores among the three LLMs were significantly different (p-value < 0.0001) with the highest score by ChatGPT (3.15±1.19), followed by Bard (2.23±1.17) and Bing (1.98±1.01). The score of ChatGPT was significantly higher than 50% (p-value = 0.0004), Bard's score was close to 50% (p-value = 0.38), and Bing's score was significantly lower than the pass score (p-value = 0.0015). Conclusion The LLMs reveal significant differences in solving case vignettes in hematology. ChatGPT exhibited the highest score, followed by Google Bard and Microsoft Bing. The observed performance trends suggest that ChatGPT holds promising potential in the medical domain. However, none of the models was capable of answering all questions accurately. Further research and optimization of language models can offer valuable contributions to healthcare and medical education applications.
背景 大型语言模型(LLMs),如ChatGPT-3.5、谷歌巴德和微软必应,在各种自然语言处理(NLP)任务中展现出了令人期待的能力。然而,它们在解决特定领域问题,尤其是血液学领域问题时的性能和准确性,尚未得到广泛研究。
目的 本研究旨在探索大型语言模型,即ChatGPT-3.5、谷歌巴德和微软必应(精准版)在解决血液学相关病例并比较它们的性能方面的能力。
方法 这是一项在印度贾坎德邦迪奥加尔全印医学科学研究所生理学和病理学系进行的横断面研究。我们精心策划了一组涵盖一系列主题和复杂程度的50个血液学病例。该数据集包括与血液疾病、血液系统恶性肿瘤、实验室检查参数、计算以及治疗方案相关的问题。每个病例及相关问题都准备了一组正确答案用于比较。我们利用ChatGPT-3.5、谷歌巴德实验版和微软必应(精准版)进行问答任务。答案由两名生理学家和一名病理学家进行检查。他们按照1到5的评分量表对答案进行评分。通过弗里德曼检验和邓恩事后检验比较这三个模型的平均得分。由于所策划问题的课程及格率为50%,通过单样本中位数检验将大型语言模型的性能与中位数2.5进行比较。
结果 三个大型语言模型的得分存在显著差异(p值<0.0001),ChatGPT得分最高(3.15±1.19),其次是巴德(2.23±1.17)和必应(1.98±1.01)。ChatGPT的得分显著高于50%(p值 = 0.0004),巴德的得分接近50%(p值 = 0.38),必应的得分显著低于及格分数(p值 = 0.0015)。
结论 大型语言模型在解决血液学病例 vignettes 方面存在显著差异。ChatGPT得分最高,其次是谷歌巴德和微软必应。观察到的性能趋势表明ChatGPT在医学领域具有令人期待的潜力。然而,没有一个模型能够准确回答所有问题。语言模型的进一步研究和优化可为医疗保健和医学教育应用提供有价值的贡献。