School of Public Health, Zhejiang University School of Medicine, Hangzhou, 310058, China.
Jarvis Research Center, Tencent YouTu Lab, Beijing, 100101, China.
J Am Med Inform Assoc. 2024 Sep 1;31(9):2054-2064. doi: 10.1093/jamia/ocae079.
Large Language Models (LLMs) such as ChatGPT and Med-PaLM have excelled in various medical question-answering tasks. However, these English-centric models encounter challenges in non-English clinical settings, primarily due to limited clinical knowledge in respective languages, a consequence of imbalanced training corpora. We systematically evaluate LLMs in the Chinese medical context and develop a novel in-context learning framework to enhance their performance.
The latest China National Medical Licensing Examination (CNMLE-2022) served as the benchmark. We collected 53 medical books and 381 149 medical questions to construct the medical knowledge base and question bank. The proposed Knowledge and Few-shot Enhancement In-context Learning (KFE) framework leverages the in-context learning ability of LLMs to integrate diverse external clinical knowledge sources. We evaluated KFE with ChatGPT (GPT-3.5), GPT-4, Baichuan2-7B, Baichuan2-13B, and QWEN-72B in CNMLE-2022 and further investigated the effectiveness of different pathways for incorporating LLMs with medical knowledge from 7 distinct perspectives.
Directly applying ChatGPT failed to qualify for the CNMLE-2022 at a score of 51. Cooperated with the KFE framework, the LLMs with varying sizes yielded consistent and significant improvements. The ChatGPT's performance surged to 70.04 and GPT-4 achieved the highest score of 82.59. This surpasses the qualification threshold (60) and exceeds the average human score of 68.70, affirming the effectiveness and robustness of the framework. It also enabled a smaller Baichuan2-13B to pass the examination, showcasing the great potential in low-resource settings.
This study shed light on the optimal practices to enhance the capabilities of LLMs in non-English medical scenarios. By synergizing medical knowledge through in-context learning, LLMs can extend clinical insight beyond language barriers in healthcare, significantly reducing language-related disparities of LLM applications and ensuring global benefit in this field.
大型语言模型(LLM)如 ChatGPT 和 Med-PaLM 在各种医学问答任务中表现出色。然而,这些以英语为中心的模型在非英语临床环境中遇到挑战,主要是因为它们在各自语言中的临床知识有限,这是由于不平衡的训练语料库造成的。我们系统地评估了中文医学环境中的 LLM,并开发了一种新的上下文学习框架来提高它们的性能。
最新的中国国家医师资格考试(CNMLE-2022)作为基准。我们收集了 53 本医学书籍和 381149 个医学问题,构建了医学知识库和问题库。所提出的知识和少样本增强上下文学习(KFE)框架利用 LLM 的上下文学习能力,整合了来自多种外部临床知识源的信息。我们在 CNMLE-2022 中评估了 KFE 与 ChatGPT(GPT-3.5)、GPT-4、Baichuan2-7B、Baichuan2-13B 和 QWEN-72B 的结合,并进一步从 7 个不同的角度研究了将 LLM 与来自医学知识的信息相结合的不同途径的有效性。
直接应用 ChatGPT 无法获得 51 分的 CNMLE-2022 及格分数。与 KFE 框架合作后,不同规模的 LLM 都取得了一致且显著的提高。ChatGPT 的成绩飙升至 70.04 分,GPT-4 则获得了 82.59 分的最高分。这超过了及格线(60),超过了人类平均成绩 68.70 分,证明了该框架的有效性和稳健性。它还使较小的 Baichuan2-13B 通过了考试,展示了在资源有限的情况下的巨大潜力。
本研究为提高 LLM 在非英语医学场景中的能力提供了最佳实践。通过在上下文中协同医学知识,LLM 可以超越语言障碍,扩展医疗保健中的临床洞察力,大大减少 LLM 应用中的语言相关差距,并确保该领域的全球受益。