From the University of Maryland Medical Intelligent Imaging (UM2ii) Center, Department of Radiology and Nuclear Medicine, University of Maryland School of Medicine, 22 S Greene St, Baltimore, MD 21201 (F.X.D., D.S., A.K., P.H.Y., V.S.P.); Department of Radiology, University of Michigan, Ann Arbor, Mich (R.C.C.); and Department of Computer Science and Electrical Engineering, University of Maryland Baltimore County, Baltimore, Md (A.J.).
Radiology. 2024 Aug;312(2):e240320. doi: 10.1148/radiol.240320.
Background Large language models (LLMs) for medical applications use unknown amounts of energy, which contribute to the overall carbon footprint of the health care system. Purpose To investigate the tradeoffs between accuracy and energy use when using different LLM types and sizes for medical applications. Materials and Methods This retrospective study evaluated five different billion (B)-parameter sizes of two open-source LLMs (Meta's Llama 2, a general-purpose model, and LMSYS Org's Vicuna 1.5, a specialized fine-tuned model) using chest radiograph reports from the National Library of Medicine's Indiana University Chest X-ray Collection. Reports with missing demographic information and missing or blank files were excluded. Models were run on local compute clusters with visual computing graphic processing units. A single-task prompt explained clinical terminology and instructed each model to confirm the presence or absence of each of the 13 CheXpert disease labels. Energy use (in kilowatt-hours) was measured using an open-source tool. Accuracy was assessed with 13 CheXpert reference standard labels for diagnostic findings on chest radiographs, where overall accuracy was the mean of individual accuracies of all 13 labels. Efficiency ratios (accuracy per kilowatt-hour) were calculated for each model type and size. Results A total of 3665 chest radiograph reports were evaluated. The Vicuna 1.5 7B and 13B models had higher efficiency ratios (737.28 and 331.40, respectively) and higher overall labeling accuracy (93.83% [3438.69 of 3665 reports] and 93.65% [3432.38 of 3665 reports], respectively) than that of the Llama 2 models (7B: efficiency ratio of 13.39, accuracy of 7.91% [289.76 of 3665 reports]; 13B: efficiency ratio of 40.90, accuracy of 74.08% [2715.15 of 3665 reports]; 70B: efficiency ratio of 22.30, accuracy of 92.70% [3397.38 of 3665 reports]). Vicuna 1.5 7B had the highest efficiency ratio (737.28 vs 13.39 for Llama 2 7B). The larger Llama 2 70B model used more than seven times the energy of its 7B counterpart (4.16 kWh vs 0.59 kWh) with low overall accuracy, resulting in an efficiency ratio of only 22.30. Conclusion Smaller fine-tuned LLMs were more sustainable than larger general-purpose LLMs, using less energy without compromising accuracy, highlighting the importance of LLM selection for medical applications. © RSNA, 2024
背景 用于医学应用的大型语言模型 (LLM) 使用未知数量的能源,这导致了医疗保健系统的整体碳足迹。目的 研究在医学应用中使用不同的 LLM 类型和大小时,在准确性和能源使用之间进行权衡。材料和方法 本回顾性研究评估了两种开源 LLM(Meta 的 Llama 2,一种通用模型,和 LMSYS Org 的 Vicuna 1.5,一种专门的微调模型)的五个不同十亿 (B) 参数大小,使用了来自国家医学图书馆的印第安纳大学 X 射线收藏的胸部 X 光报告。排除了缺少人口统计学信息和缺少或空白文件的报告。模型在带有视觉计算图形处理单元的本地计算集群上运行。单个任务提示解释了临床术语,并指示每个模型确认 13 种 CheXpert 疾病标签中的每一种的存在或不存在。使用开源工具测量能源使用量。使用 13 种 CheXpert 参考标准标签评估准确性,用于胸部 X 光的诊断结果,其中总体准确性是所有 13 种标签的个体准确性的平均值。为每个模型类型和大小计算了效率比(每千瓦时的准确性)。结果 共评估了 3665 份胸部 X 光报告。Vicuna 1.5 7B 和 13B 模型的效率比(分别为 737.28 和 331.40)和总体标记准确性(分别为 93.83%[3438.69 份报告]和 93.65%[3432.38 份报告])高于 Llama 2 模型(7B:效率比为 13.39,准确性为 7.91%[3665 份报告中的 289.76 份];13B:效率比为 40.90,准确性为 74.08%[3665 份报告中的 2715.15 份];70B:效率比为 22.30,准确性为 92.70%[3665 份报告中的 3397.38 份])。Vicuna 1.5 7B 的效率比最高(737.28 与 Llama 2 7B 的 13.39 相比)。较大的 Llama 2 70B 模型的能耗比其 7B 对应模型高出 7 倍以上(4.16 kWh 对 0.59 kWh),整体准确性较低,导致效率比仅为 22.30。结论 较小的微调 LLM 比较大的通用 LLM 更具可持续性,使用更少的能源而不会影响准确性,这突显了在医学应用中选择 LLM 的重要性。