基于微调的Llama 3由GPT驱动的放射学报告生成

GPT-Driven Radiology Report Generation with Fine-Tuned Llama 3.

作者信息

Voinea Ștefan-Vlad, Mămuleanu Mădălin, Teică Rossy Vlăduț, Florescu Lucian Mihai, Selișteanu Dan, Gheonea Ioana Andreea

机构信息

Department of Automatic Control and Electronics, University of Craiova, 200585 Craiova, Romania.

Doctoral School, University of Medicine and Pharmacy of Craiova, 200349 Craiova, Romania.

出版信息

Bioengineering (Basel). 2024 Oct 18;11(10):1043. doi: 10.3390/bioengineering11101043.

DOI:10.3390/bioengineering11101043

PMID:39451418

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11504957/

Abstract

The integration of deep learning into radiology has the potential to enhance diagnostic processes, yet its acceptance in clinical practice remains limited due to various challenges. This study aimed to develop and evaluate a fine-tuned large language model (LLM), based on Llama 3-8B, to automate the generation of accurate and concise conclusions in magnetic resonance imaging (MRI) and computed tomography (CT) radiology reports, thereby assisting radiologists and improving reporting efficiency. A dataset comprising 15,000 radiology reports was collected from the University of Medicine and Pharmacy of Craiova's Imaging Center, covering a diverse range of MRI and CT examinations made by four experienced radiologists. The Llama 3-8B model was fine-tuned using transfer-learning techniques, incorporating parameter quantization to 4-bit precision and low-rank adaptation (LoRA) with a rank of 16 to optimize computational efficiency on consumer-grade GPUs. The model was trained over five epochs using an NVIDIA RTX 3090 GPU, with intermediary checkpoints saved for monitoring. Performance was evaluated quantitatively using Bidirectional Encoder Representations from Transformers Score (BERTScore), Recall-Oriented Understudy for Gisting Evaluation (ROUGE), Bilingual Evaluation Understudy (BLEU), and Metric for Evaluation of Translation with Explicit Ordering (METEOR) metrics on a held-out test set. Additionally, a qualitative assessment was conducted, involving 13 independent radiologists who participated in a Turing-like test and provided ratings for the AI-generated conclusions. The fine-tuned model demonstrated strong quantitative performance, achieving a BERTScore F1 of 0.8054, a ROUGE-1 F1 of 0.4998, a ROUGE-L F1 of 0.4628, and a METEOR score of 0.4282. In the human evaluation, the artificial intelligence (AI)-generated conclusions were preferred over human-written ones in approximately 21.8% of cases, indicating that the model's outputs were competitive with those of experienced radiologists. The average rating of the AI-generated conclusions was 3.65 out of 5, reflecting a generally favorable assessment. Notably, the model maintained its consistency across various types of reports and demonstrated the ability to generalize to unseen data. The fine-tuned Llama 3-8B model effectively generates accurate and coherent conclusions for MRI and CT radiology reports. By automating the conclusion-writing process, this approach can assist radiologists in reducing their workload and enhancing report consistency, potentially addressing some barriers to the adoption of deep learning in clinical practice. The positive evaluations from independent radiologists underscore the model's potential utility. While the model demonstrated strong performance, limitations such as dataset bias, limited sample diversity, a lack of clinical judgment, and the need for large computational resources require further refinement and real-world validation. Future work should explore the integration of such models into clinical workflows, address ethical and legal considerations, and extend this approach to generate complete radiology reports.

摘要

将深度学习整合到放射学中有可能改善诊断流程，但由于各种挑战，其在临床实践中的接受度仍然有限。本研究旨在开发并评估一种基于Llama 3 - 8B微调的大语言模型（LLM），以自动生成磁共振成像（MRI）和计算机断层扫描（CT）放射学报告中准确且简洁的结论，从而协助放射科医生并提高报告效率。从克拉约瓦医药大学影像中心收集了一个包含15000份放射学报告的数据集，涵盖了由四位经验丰富的放射科医生进行的各种MRI和CT检查。使用迁移学习技术对Llama 3 - 8B模型进行微调，将参数量化到4位精度，并采用秩为16的低秩自适应（LoRA）来优化消费级GPU上的计算效率。该模型使用NVIDIA RTX 3090 GPU训练了五个轮次，并保存中间检查点用于监测。在一个留出的测试集上，使用来自变换器分数的双向编码器表示（BERTScore）、用于摘要评估的召回导向替身（ROUGE）、双语评估替身（BLEU）以及具有显式排序的翻译评估指标（METEOR）指标对性能进行定量评估。此外，还进行了定性评估，13名独立放射科医生参与了类似图灵测试，并对人工智能生成的结论给出评分。微调后的模型表现出强大的定量性能，BERTScore F1达到0.8054，ROUGE - 1 F1为0.4998，ROUGE - L F1为0.4628，METEOR分数为0.4282。在人工评估中，人工智能生成的结论在约21.8%的情况下比人工撰写的结论更受青睐，这表明该模型的输出与经验丰富的放射科医生的输出具有竞争力。人工智能生成结论的平均评分为3.65（满分5分），反映出总体评价良好。值得注意的是，该模型在各种类型的报告中保持了一致性，并展示了对未见数据进行泛化的能力。微调后的Llama 3 - 8B模型有效地为MRI和CT放射学报告生成准确且连贯的结论。通过自动化结论撰写过程，这种方法可以协助放射科医生减轻工作量并提高报告一致性，有可能解决临床实践中采用深度学习的一些障碍。独立放射科医生的积极评价凸显了该模型的潜在效用。虽然该模型表现出强大的性能，但诸如数据集偏差、样本多样性有限、缺乏临床判断以及需要大量计算资源等局限性仍需要进一步改进和实际验证。未来的工作应探索将此类模型整合到临床工作流程中，解决伦理和法律问题，并扩展这种方法以生成完整的放射学报告。