Department of Addiction Science, Kaohsiung Municipal Kai-Syuan Psychiatric Hospital, Kaohsiung, Taiwan.
Department of Nursing, Meiho University, Pingtung, Taiwan.
Psychiatry Clin Neurosci. 2024 Jun;78(6):347-352. doi: 10.1111/pcn.13656. Epub 2024 Feb 26.
Large language models (LLMs) have been suggested to play a role in medical education and medical practice. However, the potential of their application in the psychiatric domain has not been well-studied.
In the first step, we compared the performance of ChatGPT GPT-4, Bard, and Llama-2 in the 2022 Taiwan Psychiatric Licensing Examination conducted in traditional Mandarin. In the second step, we compared the scores of these three LLMs with those of 24 experienced psychiatrists in 10 advanced clinical scenario questions designed for psychiatric differential diagnosis.
Only GPT-4 passed the 2022 Taiwan Psychiatric Licensing Examination (scoring 69 and ≥ 60 being considered a passing grade), while Bard scored 36 and Llama-2 scored 25. GPT-4 outperformed Bard and Llama-2, especially in the areas of 'Pathophysiology & Epidemiology' (χ = 22.4, P < 0.001) and 'Psychopharmacology & Other therapies' (χ = 15.8, P < 0.001). In the differential diagnosis, the mean score of the 24 experienced psychiatrists (mean 6.1, standard deviation 1.9) was higher than that of GPT-4 (5), Bard (3), and Llama-2 (1).
Compared to Bard and Llama-2, GPT-4 demonstrated superior abilities in identifying psychiatric symptoms and making clinical judgments. Besides, GPT-4's ability for differential diagnosis closely approached that of the experienced psychiatrists. GPT-4 revealed a promising potential as a valuable tool in psychiatric practice among the three LLMs.
大型语言模型(LLMs)被认为在医学教育和医学实践中发挥作用。然而,它们在精神科领域的应用潜力尚未得到充分研究。
在第一步中,我们比较了 ChatGPT GPT-4、Bard 和 Llama-2 在 2022 年以传统汉语进行的台湾精神科执照考试中的表现。在第二步中,我们将这三种大语言模型的分数与 24 名经验丰富的精神科医生在 10 个专为精神科鉴别诊断设计的高级临床情景问题中的分数进行了比较。
只有 GPT-4 通过了 2022 年台湾精神科执照考试(得分为 69 分及以上被认为及格),而 Bard 得分为 36 分,Llama-2 得分为 25 分。GPT-4 在“病理生理学和流行病学”(χ²=22.4,P<0.001)和“精神药理学和其他疗法”(χ²=15.8,P<0.001)方面的表现优于 Bard 和 Llama-2。在鉴别诊断中,24 名经验丰富的精神科医生的平均得分(6.1,标准差 1.9)高于 GPT-4(5)、Bard(3)和 Llama-2(1)。
与 Bard 和 Llama-2 相比,GPT-4 在识别精神科症状和做出临床判断方面表现出更高的能力。此外,GPT-4 的鉴别诊断能力与经验丰富的精神科医生相当。在这三种大语言模型中,GPT-4 作为一种有价值的精神科实践工具,具有广阔的应用前景。