From the Athinoula A. Martinos Center for Biomedical Imaging, Massachusetts General Hospital and Harvard Medical School, 149 Thirteenth St, Charlestown, MA 02129 (F.J.D., T.R.B., M.C.C., A.E.K., C.P.B.); Department of Radiology, Charité-Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt Universität zu Berlin, Berlin, Germany (F.J.D., L.D., F.A.M., F.B., L.J.); Department of Pediatric Oncology, Dana-Farber Cancer Institute, Boston, Mass (L.J.); Department of Diagnostic and Interventional Radiology, Technical University of Munich, Munich, Germany (L.C.A.); Mass General Brigham Data Science Office, Boston, Mass (J.S., T.S., C.P.B.); Microsoft Health and Life Sciences (HLS), Redmond, Wash (J.M.); Klinikum rechts der Isar, Technical University of Munich, Munich, Germany (K.K.B.); Department of Radiology and Nuclear Medicine, German Heart Center Munich, Munich, Germany (K.K.B.); and Department of Cardiovascular Radiology and Nuclear Medicine, Technical University of Munich, School of Medicine and Health, German Heart Center, TUM University Hospital, Munich, Germany (K.K.B.).
Radiology. 2024 Oct;313(1):e241139. doi: 10.1148/radiol.241139.
Background Rapid advances in large language models (LLMs) have led to the development of numerous commercial and open-source models. While recent publications have explored OpenAI's GPT-4 to extract information of interest from radiology reports, there has not been a real-world comparison of GPT-4 to leading open-source models. Purpose To compare different leading open-source LLMs to GPT-4 on the task of extracting relevant findings from chest radiograph reports. Materials and Methods Two independent datasets of free-text radiology reports from chest radiograph examinations were used in this retrospective study performed between February 2, 2024, and February 14, 2024. The first dataset consisted of reports from the ImaGenome dataset, providing reference standard annotations from the MIMIC-CXR database acquired between 2011 and 2016. The second dataset consisted of randomly selected reports created at the Massachusetts General Hospital between July 2019 and July 2021. In both datasets, the commercial models GPT-3.5 Turbo and GPT-4 were compared with open-source models that included Mistral-7B and Mixtral-8 × 7B (Mistral AI), Llama 2-13B and Llama 2-70B (Meta), and Qwen1.5-72B (Alibaba Group), as well as CheXbert and CheXpert-labeler (Stanford ML Group), in their ability to accurately label the presence of multiple findings in radiograph text reports using zero-shot and few-shot prompting. The McNemar test was used to compare F1 scores between models. Results On the ImaGenome dataset ( = 450), the open-source model with the highest score, Llama 2-70B, achieved micro F1 scores of 0.97 and 0.97 for zero-shot and few-shot prompting, respectively, compared with the GPT-4 F1 scores of 0.98 and 0.98 ( > .99 and < .001 for superiority of GPT-4). On the institutional dataset ( = 500), the open-source model with the highest score, an ensemble model, achieved micro F1 scores of 0.96 and 0.97 for zero-shot and few-shot prompting, respectively, compared with the GPT-4 F1 scores of 0.98 and 0.97 ( < .001 and > .99 for superiority of GPT-4). Conclusion Although GPT-4 was superior to open-source models in zero-shot report labeling, few-shot prompting with a small number of example reports closely matched the performance of GPT-4. The benefit of few-shot prompting varied across datasets and models. © RSNA, 2024
背景 大型语言模型(LLM)的快速发展催生了众多商业和开源模型。尽管最近的出版物探讨了 OpenAI 的 GPT-4 从放射学报告中提取感兴趣信息的能力,但尚未对 GPT-4 与领先的开源模型进行实际比较。目的 比较不同的领先开源 LLM 在从胸部 X 光报告中提取相关发现的任务上的性能。材料与方法 这项回顾性研究于 2024 年 2 月 2 日至 2024 年 2 月 14 日期间,使用了来自胸部 X 光检查的自由文本放射学报告的两个独立数据集。第一个数据集由 ImaGenome 数据集提供,其中包含来自 2011 年至 2016 年采集的 MIMIC-CXR 数据库的参考标准注释。第二个数据集由马萨诸塞州总医院于 2019 年 7 月至 2021 年 7 月期间随机选择的报告组成。在这两个数据集中,将商业模型 GPT-3.5 Turbo 和 GPT-4 与开源模型进行了比较,包括 Mistral-7B 和 Mixtral-8×7B(Mistral AI)、Llama 2-13B 和 Llama 2-70B(Meta)以及 Qwen1.5-72B(Alibaba Group),以及 CheXbert 和 CheXpert-labeler(斯坦福机器学习组),它们能够使用零-shot 和 few-shot 提示准确标记放射学文本报告中存在的多种发现。使用 McNemar 检验比较模型之间的 F1 评分。结果 在 ImaGenome 数据集( = 450)上,得分最高的开源模型 Llama 2-70B,在零-shot 和 few-shot 提示下的微 F1 评分分别为 0.97 和 0.97,而 GPT-4 的 F1 评分分别为 0.98 和 0.98(>.99 和<.001 表示 GPT-4 的优越性)。在机构数据集( = 500)上,得分最高的开源模型为集成模型,在零-shot 和 few-shot 提示下的微 F1 评分分别为 0.96 和 0.97,而 GPT-4 的 F1 评分分别为 0.98 和 0.97(<.001 和>.99 表示 GPT-4 的优越性)。结论 尽管 GPT-4 在零-shot 报告标记方面优于开源模型,但使用少量示例报告进行 few-shot 提示可与 GPT-4 的性能相匹配。few-shot 提示的益处因数据集和模型而异。