Suppr超能文献

ChatGPT 在中文体检、病历和教育方面的表现和探索:为医疗 AI 铺平道路。

Performance and exploration of ChatGPT in medical examination, records and education in Chinese: Pave the way for medical AI.

机构信息

Department of Pain Management, Xuanwu Hospital, Capital Medical University.

Department of Anesthesia, China-Japan Union Hospital of Jilin University.

出版信息

Int J Med Inform. 2023 Sep;177:105173. doi: 10.1016/j.ijmedinf.2023.105173. Epub 2023 Aug 4.

Abstract

BACKGROUND

Although chat generative pre-trained transformer (ChatGPT) has made several successful attempts in the medical field, most notably in answering medical questions in English, no studies have evaluated ChatGPT's performance in a Chinese context for a medical task.

OBJECTIVE

The aim of this study was to evaluate ChatGPT's ability to understand medical knowledge in Chinese, as well as its potential to serve as an electronic health infrastructure for medical development, by evaluating its performance in medical examinations, records, and education.

METHOD

The Chinese (CNMLE) and English (ENMLE) datasets of the China National Medical Licensing Examination and the Chinese dataset (NEEPM) of the China National Entrance Examination for Postgraduate Clinical Medicine Comprehensive Ability were used to evaluate the performance of ChatGPT (GPT-3.5 and GPT-4). We assessed answer accuracy, verbal fluency, and the classification of incorrect responses owing to hallucinations on multiple occasions. In addition, we tested ChatGPT's performance on discharge summaries and group learning in a Chinese context on a small scale.

RESULTS

The accuracy of GPT-3.5 in CNMLE, ENMLE, and NEEPM was 56% (56/100), 76% (76/100), and 62% (62/100), respectively, compared to that of GPT-4, which was of 84% (84/100), 86% (86/100), and 82% (82/100). The verbal fluency of all the ChatGPT responses exceeded 95%. Among the GPT-3.5 incorrect responses, the proportions of open-domain hallucinations were 66 % (29/44), 54 % (14/24), and 63 % (24/38), whereas close-domain hallucinations accounted for 34 % (15/44), 46 % (14/24), and 37 % (14/38), respectively. By contrast, GPT-4 open-domain hallucinations accounted for 56% (9/16), 43% (6/14), and 83% (15/18), while close-domain hallucinations accounted for 44% (7/16), 57% (8/14), and 17% (3/18), respectively. In the discharge summary, ChatGPT demonstrated logical coherence, however GPT-3.5 could not fulfill the quality requirements, while GPT-4 met the qualification of 60% (6/10). In group learning, the verbal fluency and interaction satisfaction with ChatGPT were 100% (10/10).

CONCLUSION

ChatGPT based on GPT-4 is at par with Chinese medical practitioners who passed the CNMLE and at the standard required for admission to clinical medical graduate programs in China. The GPT-4 shows promising potential for discharge summarization and group learning. Additionally, it shows high verbal fluency, resulting in a positive human-computer interaction experience. GPT-4 significantly improves multiple capabilities and reduces hallucinations compared to the previous GPT-3.5 model, with a particular leap forward in the Chinese comprehension capability of medical tasks. Artificial intelligence (AI) systems face the challenges of hallucinations, legal risks, and ethical issues. However, we discovered ChatGPT's potential to promote medical development as an electronic health infrastructure, paving the way for Medical AI to become necessary.

摘要

背景

尽管聊天生成型预训练转换器(ChatGPT)在医学领域取得了多项成功尝试,尤其是在回答英语医学问题方面,但在中文语境下针对医学任务评估其性能的研究尚未见报道。

目的

本研究旨在评估 ChatGPT 在理解中文医学知识方面的能力,以及其作为医疗发展的电子健康基础设施的潜力,方法是评估其在医学检查、记录和教育方面的表现。

方法

使用中国医师资格考试(CNMLE)和中国研究生入学考试临床医学综合能力(NEEPM)的中英文数据集评估 ChatGPT(GPT-3.5 和 GPT-4)的性能。我们评估了多次回答的准确性、语言流畅性和由于幻觉导致的错误回答分类。此外,我们还测试了 ChatGPT 在中文情境下的出院小结和小组学习中的表现。

结果

与 GPT-4 相比,GPT-3.5 在 CNMLE、ENMLE 和 NEEPM 中的准确率分别为 56%(56/100)、76%(76/100)和 62%(62/100),GPT-4 的准确率分别为 84%(84/100)、86%(86/100)和 82%(82/100)。所有 ChatGPT 回答的语言流畅性均超过 95%。在 GPT-3.5 的错误回答中,开放领域幻觉的比例分别为 66%(29/44)、54%(14/24)和 63%(24/38),而封闭领域幻觉分别占 34%(15/44)、46%(14/24)和 37%(14/38)。相比之下,GPT-4 的开放领域幻觉占 56%(9/16)、43%(6/14)和 83%(15/18),而封闭领域幻觉分别占 44%(7/16)、57%(8/14)和 17%(3/18)。在出院小结中,ChatGPT 表现出逻辑连贯性,而 GPT-3.5 无法满足质量要求,而 GPT-4 则符合 60%(6/10)的资格。在小组学习中,ChatGPT 的语言流畅性和互动满意度为 100%(10/10)。

结论

基于 GPT-4 的 ChatGPT 与通过 CNMLE 的中国执业医师水平相当,也达到了中国临床医学研究生入学标准。GPT-4 在出院小结和小组学习方面具有很大的潜力。此外,它具有较高的语言流畅性,从而带来了积极的人机交互体验。与之前的 GPT-3.5 模型相比,GPT-4 显著提高了多项能力,减少了幻觉,特别是在中文医学任务的理解能力方面有了显著的提高。人工智能(AI)系统面临幻觉、法律风险和伦理问题等挑战。然而,我们发现 ChatGPT 作为电子健康基础设施有潜力促进医疗发展,为医疗 AI 的发展铺平了道路。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验