Department of Paediatrics, Schulich School of Medicine and Dentistry, Western University, London, Ontario, Canada.
Division of Nephrology, Children's Hospital, London Health Sciences Centre, London, Ontario, Canada.
PLoS One. 2024 Jul 31;19(7):e0307383. doi: 10.1371/journal.pone.0307383. eCollection 2024.
ChatGPT is a large language model (LLM) trained on over 400 billion words from books, articles, and websites. Its extensive training draws from a large database of information, making it valuable as a diagnostic aid. Moreover, its capacity to comprehend and generate human language allows medical trainees to interact with it, enhancing its appeal as an educational resource. This study aims to investigate ChatGPT's diagnostic accuracy and utility in medical education.
150 Medscape case challenges (September 2021 to January 2023) were inputted into ChatGPT. The primary outcome was the number (%) of cases for which the answer given was correct. Secondary outcomes included diagnostic accuracy, cognitive load, and quality of medical information. A qualitative content analysis was also conducted to assess its responses.
ChatGPT answered 49% (74/150) cases correctly. It had an overall accuracy of 74%, a precision of 48.67%, sensitivity of 48.67%, specificity of 82.89%, and an AUC of 0.66. Most answers were considered low cognitive load 51% (77/150) and most answers were complete and relevant 52% (78/150).
ChatGPT in its current form is not accurate as a diagnostic tool. ChatGPT does not necessarily give factual correctness, despite the vast amount of information it was trained on. Based on our qualitative analysis, ChatGPT struggles with the interpretation of laboratory values, imaging results, and may overlook key information relevant to the diagnosis. However, it still offers utility as an educational tool. ChatGPT was generally correct in ruling out a specific differential diagnosis and providing reasonable next diagnostic steps. Additionally, answers were easy to understand, showcasing a potential benefit in simplifying complex concepts for medical learners. Our results should guide future research into harnessing ChatGPT's potential educational benefits, such as simplifying medical concepts and offering guidance on differential diagnoses and next steps.
ChatGPT 是一个大型语言模型(LLM),经过了超过 4000 亿个单词的训练,这些单词来自书籍、文章和网站。它的广泛训练利用了大量的信息数据库,使其成为一种有价值的诊断辅助工具。此外,它理解和生成人类语言的能力使医学实习生能够与它互动,增强了它作为教育资源的吸引力。本研究旨在探讨 ChatGPT 在医学教育中的诊断准确性和实用性。
将 150 个 Medscape 病例挑战(2021 年 9 月至 2023 年 1 月)输入 ChatGPT。主要结果是回答正确的病例数量(%)。次要结果包括诊断准确性、认知负荷和医学信息质量。还进行了定性内容分析,以评估其回答。
ChatGPT 正确回答了 49%(74/150)的病例。它的总体准确率为 74%,精度为 48.67%,灵敏度为 48.67%,特异性为 82.89%,AUC 为 0.66。大多数回答被认为是低认知负荷(51%,77/150),大多数回答是完整和相关的(52%,78/150)。
目前形式的 ChatGPT 作为诊断工具不够准确。尽管它经过了大量信息的训练,但 ChatGPT 并不一定能给出正确的事实。根据我们的定性分析,ChatGPT 在解释实验室值、成像结果方面存在困难,并且可能忽略与诊断相关的关键信息。然而,它仍然作为一种教育工具具有实用性。ChatGPT 在排除特定鉴别诊断和提供合理的下一步诊断步骤方面通常是正确的。此外,回答易于理解,为简化医学学习者的复杂概念展示了潜在的益处。我们的结果应该指导未来对利用 ChatGPT 的潜在教育益处的研究,例如简化医学概念和提供鉴别诊断和下一步指导。