ChatGPT在日本国家医师资格考试医学问题上的准确性:评估研究
Accuracy of ChatGPT on Medical Questions in the National Medical Licensing Examination in Japan: Evaluation Study.
作者信息
Yanagita Yasutaka, Yokokawa Daiki, Uchida Shun, Tawara Junsuke, Ikusaka Masatomi
机构信息
Department of General Medicine, Chiba University Hospital, Chiba, Japan.
Department of Internal Medicine, Sanmu Medical Center, Chiba, Japan.
出版信息
JMIR Form Res. 2023 Oct 13;7:e48023. doi: 10.2196/48023.
BACKGROUND
ChatGPT (OpenAI) has gained considerable attention because of its natural and intuitive responses. ChatGPT sometimes writes plausible-sounding but incorrect or nonsensical answers, as stated by OpenAI as a limitation. However, considering that ChatGPT is an interactive AI that has been trained to reduce the output of unethical sentences, the reliability of the training data is high and the usefulness of the output content is promising. Fortunately, in March 2023, a new version of ChatGPT, GPT-4, was released, which, according to internal evaluations, was expected to increase the likelihood of producing factual responses by 40% compared with its predecessor, GPT-3.5. The usefulness of this version of ChatGPT in English is widely appreciated. It is also increasingly being evaluated as a system for obtaining medical information in languages other than English. Although it does not reach a passing score on the national medical examination in Chinese, its accuracy is expected to gradually improve. Evaluation of ChatGPT with Japanese input is limited, although there have been reports on the accuracy of ChatGPT's answers to clinical questions regarding the Japanese Society of Hypertension guidelines and on the performance of the National Nursing Examination.
OBJECTIVE
The objective of this study is to evaluate whether ChatGPT can provide accurate diagnoses and medical knowledge for Japanese input.
METHODS
Questions from the National Medical Licensing Examination (NMLE) in Japan, administered by the Japanese Ministry of Health, Labour and Welfare in 2022, were used. All 400 questions were included. Exclusion criteria were figures and tables that ChatGPT could not recognize; only text questions were extracted. We instructed GPT-3.5 and GPT-4 to input the Japanese questions as they were and to output the correct answers for each question. The output of ChatGPT was verified by 2 general practice physicians. In case of discrepancies, they were checked by another physician to make a final decision. The overall performance was evaluated by calculating the percentage of correct answers output by GPT-3.5 and GPT-4.
RESULTS
Of the 400 questions, 292 were analyzed. Questions containing charts, which are not supported by ChatGPT, were excluded. The correct response rate for GPT-4 was 81.5% (237/292), which was significantly higher than the rate for GPT-3.5, 42.8% (125/292). Moreover, GPT-4 surpassed the passing standard (>72%) for the NMLE, indicating its potential as a diagnostic and therapeutic decision aid for physicians.
CONCLUSIONS
GPT-4 reached the passing standard for the NMLE in Japan, entered in Japanese, although it is limited to written questions. As the accelerated progress in the past few months has shown, the performance of the AI will improve as the large language model continues to learn more, and it may well become a decision support system for medical professionals by providing more accurate information.
背景
ChatGPT(OpenAI)因其自然直观的回答而备受关注。正如OpenAI所指出的那样,ChatGPT有时会给出听起来合理但不正确或无意义的答案,这是其局限性。然而,鉴于ChatGPT是一种经过训练以减少不道德语句输出的交互式人工智能,其训练数据的可靠性较高,输出内容的实用性也很有前景。幸运的是,2023年3月发布了ChatGPT的新版本GPT-4,根据内部评估,预计与前身GPT-3.5相比,生成事实性回答的可能性将提高40%。这个版本的ChatGPT在英语方面的实用性广受赞誉。它也越来越多地被评估为一种用于获取非英语语言医学信息的系统。尽管它在中国国家医学考试中未达到及格分数,但其准确性有望逐步提高。关于用日语输入评估ChatGPT的研究有限,尽管有报告涉及ChatGPT对日本高血压学会指南临床问题回答的准确性以及国家护理考试的表现。
目的
本研究的目的是评估ChatGPT能否为日语输入提供准确的诊断和医学知识。
方法
使用了日本厚生劳动省2022年管理的日本国家医师资格考试(NMLE)中的问题。共纳入所有400道题。排除标准是ChatGPT无法识别的图表;仅提取文本问题。我们指示GPT-3.5和GPT-4按原样输入日语问题并输出每个问题的正确答案。ChatGPT的输出由2名全科医生进行验证。如有差异,则由另一名医生进行检查以做出最终决定。通过计算GPT-3.5和GPT-4输出的正确答案百分比来评估总体表现。
结果
400道题中,分析了292道。排除了包含ChatGPT不支持的图表的问题。GPT-4的正确回答率为81.5%(237/292),显著高于GPT-3.5的42.8%(125/292)。此外,GPT-4超过了NMLE的及格标准(>72%),表明其作为医生诊断和治疗决策辅助工具的潜力。
结论
GPT-4在以日语输入的情况下达到了日本NMLE的及格标准,尽管仅限于书面问题。正如过去几个月加速发展所显示的那样,随着大语言模型不断学习更多知识,人工智能的性能将会提高,它很可能会通过提供更准确的信息成为医疗专业人员的决策支持系统。