Fast Dennis, Adams Lisa C, Busch Felix, Fallon Conor, Huppertz Marc, Siepmann Robert, Prucker Philipp, Bayerl Nadine, Truhn Daniel, Makowski Marcus, Löser Alexander, Bressem Keno K
DATEXIS, Berliner Hochschule für Technik (BHT), Berlin, Germany.
Department of Diagnostic and Interventional Radiology, Technical University of Munich, School of Medicine and Health, Klinikum rechts der Isar, TUM University Hospital, Munich, Germany.
NPJ Digit Med. 2024 Dec 12;7(1):358. doi: 10.1038/s41746-024-01356-6.
Autonomous Medical Evaluation for Guideline Adherence (AMEGA) is a comprehensive benchmark designed to evaluate large language models' adherence to medical guidelines across 20 diagnostic scenarios spanning 13 specialties. It includes an evaluation framework and methodology to assess models' capabilities in medical reasoning, differential diagnosis, treatment planning, and guideline adherence, using open-ended questions that mirror real-world clinical interactions. It includes 135 questions and 1337 weighted scoring elements designed to assess comprehensive medical knowledge. In tests of 17 LLMs, GPT-4 scored highest with 41.9/50, followed closely by Llama-3 70B and WizardLM-2-8x22B. For comparison, a recent medical graduate scored 25.8/50. The benchmark introduces novel content to avoid the issue of LLMs memorizing existing medical data. AMEGA's publicly available code supports further research in AI-assisted clinical decision-making, aiming to enhance patient care by aiding clinicians in diagnosis and treatment under time constraints.
用于评估指南遵循情况的自主医学评估(AMEGA)是一个全面的基准,旨在评估大语言模型在涵盖13个专业的20种诊断场景中对医学指南的遵循情况。它包括一个评估框架和方法,通过使用反映现实世界临床互动的开放式问题,来评估模型在医学推理、鉴别诊断、治疗规划和指南遵循方面的能力。它包括135个问题和1337个加权评分要素,旨在评估全面的医学知识。在对17个大语言模型的测试中,GPT-4以41.9/50的成绩得分最高,紧随其后的是Llama-3 70B和WizardLM-2-8x22B。相比之下,最近的医学毕业生得分是25.8/50。该基准引入了新颖的内容,以避免大语言模型记忆现有医学数据的问题。AMEGA的公开代码支持在人工智能辅助临床决策方面的进一步研究,旨在通过在时间限制下协助临床医生进行诊断和治疗来提高患者护理水平。