Maitin Ana M, Nogales Alberto, Fernández-Rincón Sergio, Aranguren Enrique, Cervera-Barba Emilio, Denizon-Arranz Sophia, Mateos-Rodríguez Alonso, García-Tejedor Álvaro J
CEIEC, Universidad Francisco de Vitoria, Pozuelo de Alarcón, 28223 Madrid, Spain.
Facultad de Medicina, Universidad Francisco de Vitoria, Pozuelo de Alarcón, 28223 Madrid, Spain.
J Am Med Inform Assoc. 2025 Feb 1;32(2):341-348. doi: 10.1093/jamia/ocae302.
We evaluate the effectiveness of large language models (LLMs), specifically GPT-based (GPT-3.5 and GPT-4) and Llama-2 models (13B and 7B architectures), in autonomously assessing clinical records (CRs) to enhance medical education and diagnostic skills.
Various techniques, including prompt engineering, fine-tuning (FT), and low-rank adaptation (LoRA), were implemented and compared on Llama-2 7B. These methods were assessed using prompts in both English and Spanish to determine their adaptability to different languages. Performance was benchmarked against GPT-3.5, GPT-4, and Llama-2 13B.
GPT-based models, particularly GPT-4, demonstrated promising performance closely aligned with specialist evaluations. Application of FT on Llama-2 7B improved text comprehension in Spanish, equating its performance to that of Llama-2 13B with English prompts. Low-rank adaptation significantly enhanced performance, surpassing GPT-3.5 results when combined with FT. This indicates LoRA's effectiveness in adapting open-source models for specific tasks.
While GPT-4 showed superior performance, FT and LoRA on Llama-2 7B proved crucial in improving language comprehension and task-specific accuracy. Identified limitations highlight the need for further research.
This study underscores the potential of LLMs in medical education, providing an innovative, effective approach to CR correction. Low-rank adaptation emerged as the most effective technique, enabling open-source models to perform on par with proprietary models. Future research should focus on overcoming current limitations to further improve model performance.
我们评估大语言模型(LLMs),特别是基于GPT的模型(GPT - 3.5和GPT - 4)以及Llama - 2模型(13B和7B架构)在自主评估临床记录(CRs)以提升医学教育和诊断技能方面的有效性。
在Llama - 2 7B上实施并比较了各种技术,包括提示工程、微调(FT)和低秩自适应(LoRA)。使用英语和西班牙语的提示对这些方法进行评估,以确定它们对不同语言的适应性。将性能与GPT - 3.5、GPT - 4和Llama - 2 13B进行基准测试。
基于GPT的模型,特别是GPT - 4,表现出与专家评估密切一致的良好性能。在Llama - 2 7B上应用FT提高了西班牙语的文本理解能力,使其在使用西班牙语提示时的性能与使用英语提示的Llama - 2 13B相当。低秩自适应显著提高了性能,与FT结合时超过了GPT - 3.5的结果。这表明LoRA在使开源模型适应特定任务方面的有效性。
虽然GPT - 4表现出卓越的性能,但Llama - 2 7B上的FT和LoRA在提高语言理解和特定任务准确性方面被证明至关重要。已确定的局限性凸显了进一步研究的必要性。
本研究强调了大语言模型在医学教育中的潜力,提供了一种创新、有效的临床记录校正方法。低秩自适应成为最有效的技术,使开源模型能够与专有模型表现相当。未来的研究应专注于克服当前的局限性,以进一步提高模型性能。