Chen Tse Chiang, Kaminski Emily, Koduri Laila, Singer Alyssa, Singer Jorie, Couldwell Mitch, Delashaw Johnny, Dumont Aaron, Wang Arthur
Department of Neurology, Tulane University School of Medicine, New Orleans, Louisiana, USA.
Tulane University School of Medicine, New Orleans, Louisiana, USA.
World Neurosurg. 2023 Nov;179:e342-e347. doi: 10.1016/j.wneu.2023.08.088. Epub 2023 Aug 26.
ChatGPT is a large language model artificial intelligence chatbot that has been applied to different aspects of the medical field. Our study aims to assess the quality of chatGPT to evaluate patients based on their exams for different scores including Glasgow Coma Scale (GCS), intracranial hemorrhage score (ICH), and Hunt & Hess (H&H) classification.
We created batches of patient test cases with detailed neurological exams, totaling 20 cases and created variants of increasing complex phrasing of the test cases. Using ChatGPT, we assessed repeatability and quantified the errors, including the average error rate (AER) and magnitude of errors (AME). We repeated this process for the H&H and the ICH score using base cases. Specific prompts were created for each calculator.
The GCS calculator on 10 base test cases had an AER/AME of 10%/0.150. The accuracy of ChatGPT decreased with increasing complexity; for example, in a variation where crucial information was missing, the AER was 45% for 20 cases. For H&H, AER/AME was 13%/0.13 and for ICH, AER/AME was 27.5%/0.325. Using a simple prompt resulted in a significantly higher error rate of 70%.
ChatGPT demonstrates ability in this proof-of-concept experiment in evaluating neuroexams using established assessment scales including GCS, ICH, and H&H. However, it has limitations in accuracy and may "hallucinate" with complex or vague descriptions. Nonetheless, ChatGPT, has promising potential in medicine.
ChatGPT是一种大型语言模型人工智能聊天机器人,已应用于医学领域的不同方面。我们的研究旨在评估ChatGPT根据患者检查结果评估不同评分的质量,这些评分包括格拉斯哥昏迷量表(GCS)、颅内出血评分(ICH)和Hunt&Hess(H&H)分级。
我们创建了一批包含详细神经学检查的患者测试病例,共20例,并创建了测试病例措辞越来越复杂的变体。使用ChatGPT,我们评估了可重复性并对错误进行了量化,包括平均错误率(AER)和错误幅度(AME)。我们使用基础病例对H&H和ICH评分重复了这个过程。为每个计算器创建了特定的提示。
10个基础测试病例的GCS计算器的AER/AME为10%/0.150。ChatGPT的准确性随着复杂性的增加而降低;例如,在一个缺少关键信息的变体中,20例病例的AER为45%。对于H&H,AER/AME为13%/0.13,对于ICH,AER/AME为27.5%/0.325。使用简单提示会导致高达70%的显著更高错误率。
在这个概念验证实验中,ChatGPT展示了使用包括GCS、ICH和H&H在内的既定评估量表评估神经学检查的能力。然而,它在准确性方面存在局限性,并且可能会对复杂或模糊的描述“产生幻觉”。尽管如此,ChatGPT在医学领域具有广阔的潜力。