Suppr超能文献

ChatGPT 在临床医学研究生入学考试中的表现:调查研究。

Performance of ChatGPT on the Chinese Postgraduate Examination for Clinical Medicine: Survey Study.

机构信息

Department of Endocrine, The Second Affiliated Hospital of Nanchang University, Jiangxi, China.

Department of Cardiology, The Eighth Affiliated Hospital of Sun Yat-sen University, Shenzhen, China.

出版信息

JMIR Med Educ. 2024 Feb 9;10:e48514. doi: 10.2196/48514.

Abstract

BACKGROUND

ChatGPT, an artificial intelligence (AI) based on large-scale language models, has sparked interest in the field of health care. Nonetheless, the capabilities of AI in text comprehension and generation are constrained by the quality and volume of available training data for a specific language, and the performance of AI across different languages requires further investigation. While AI harbors substantial potential in medicine, it is imperative to tackle challenges such as the formulation of clinical care standards; facilitating cultural transitions in medical education and practice; and managing ethical issues including data privacy, consent, and bias.

OBJECTIVE

The study aimed to evaluate ChatGPT's performance in processing Chinese Postgraduate Examination for Clinical Medicine questions, assess its clinical reasoning ability, investigate potential limitations with the Chinese language, and explore its potential as a valuable tool for medical professionals in the Chinese context.

METHODS

A data set of Chinese Postgraduate Examination for Clinical Medicine questions was used to assess the effectiveness of ChatGPT's (version 3.5) medical knowledge in the Chinese language, which has a data set of 165 medical questions that were divided into three categories: (1) common questions (n=90) assessing basic medical knowledge, (2) case analysis questions (n=45) focusing on clinical decision-making through patient case evaluations, and (3) multichoice questions (n=30) requiring the selection of multiple correct answers. First of all, we assessed whether ChatGPT could meet the stringent cutoff score defined by the government agency, which requires a performance within the top 20% of candidates. Additionally, in our evaluation of ChatGPT's performance on both original and encoded medical questions, 3 primary indicators were used: accuracy, concordance (which validates the answer), and the frequency of insights.

RESULTS

Our evaluation revealed that ChatGPT scored 153.5 out of 300 for original questions in Chinese, which signifies the minimum score set to ensure that at least 20% more candidates pass than the enrollment quota. However, ChatGPT had low accuracy in answering open-ended medical questions, with only 31.5% total accuracy. The accuracy for common questions, multichoice questions, and case analysis questions was 42%, 37%, and 17%, respectively. ChatGPT achieved a 90% concordance across all questions. Among correct responses, the concordance was 100%, significantly exceeding that of incorrect responses (n=57, 50%; P<.001). ChatGPT provided innovative insights for 80% (n=132) of all questions, with an average of 2.95 insights per accurate response.

CONCLUSIONS

Although ChatGPT surpassed the passing threshold for the Chinese Postgraduate Examination for Clinical Medicine, its performance in answering open-ended medical questions was suboptimal. Nonetheless, ChatGPT exhibited high internal concordance and the ability to generate multiple insights in the Chinese language. Future research should investigate the language-based discrepancies in ChatGPT's performance within the health care context.

摘要

背景

ChatGPT 是一种基于大规模语言模型的人工智能,它在医疗保健领域引起了关注。然而,人工智能在文本理解和生成方面的能力受到特定语言可用训练数据的质量和数量的限制,并且需要进一步研究人工智能在不同语言中的性能。虽然人工智能在医学中有很大的潜力,但必须解决临床护理标准的制定、医学教育和实践中的文化转变以及数据隐私、同意和偏见等伦理问题。

目的

本研究旨在评估 ChatGPT 在处理中国研究生入学考试临床医学试题方面的表现,评估其临床推理能力,探讨其在中文语言方面的潜在局限性,并探讨其作为中国语境下医学专业人员有价值工具的潜力。

方法

使用中国研究生入学考试临床医学试题数据集评估 ChatGPT(版本 3.5)在中文医学知识方面的有效性,该数据集包含 165 个医学试题,分为三类:(1)普通试题(n=90)评估基础知识,(2)案例分析试题(n=45)通过患者案例评估专注于临床决策,以及(3)多项选择题(n=30)要求选择多个正确答案。首先,我们评估了 ChatGPT 是否能够达到政府机构定义的严格截止分数,该分数要求在候选人中排名前 20%。此外,在评估 ChatGPT 对原始和编码医学试题的表现时,我们使用了 3 个主要指标:准确性、一致性(验证答案)和洞察力频率。

结果

我们的评估结果显示,ChatGPT 在中文原始试题中得分为 153.5 分,这是确保至少 20%的考生通过考试,超过招生名额的最低分数。然而,ChatGPT 在回答开放式医学试题方面的准确性较低,总准确率仅为 31.5%。普通试题、多项选择题和案例分析试题的准确率分别为 42%、37%和 17%。ChatGPT 在所有试题中均达到 90%的一致性。在正确答案中,一致性为 100%,明显高于错误答案(n=57,50%;P<.001)。ChatGPT 为 80%(n=132)的所有试题提供了创新性的见解,平均每个正确答案提供 2.95 个见解。

结论

尽管 ChatGPT 超过了中国研究生入学考试的及格线,但它在回答开放式医学试题方面的表现并不理想。然而,ChatGPT 在中文语言中表现出较高的内部一致性和生成多个见解的能力。未来的研究应该调查 ChatGPT 在医疗保健环境中的语言表现差异。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f6c0/10891494/f6ac10d4ff65/mededu_v10i1e48514_fig1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验