文献检索，用中文搜 PubMed

BACKGROUND

The impression section integrates key findings of a radiology report but can be subjective and variable. We sought to fine-tune and evaluate an open-source Large Language Model (LLM) in automatically generating impressions from the remainder of a radiology report across different imaging modalities and hospitals.

METHODS

In this institutional review board-approved retrospective study, we collated a dataset of CT, US, and MRI radiology reports from the University of California San Francisco Medical Center (UCSFMC) (n = 372,716) and the Zuckerberg San Francisco General (ZSFG) Hospital and Trauma Center (n = 60,049), both under a single institution. The Recall-Oriented Understudy for Gisting Evaluation (ROUGE) score, an automatic natural language evaluation metric that measures word overlap, was used for automatic natural language evaluation. A reader study with five cardiothoracic radiologists was performed to more strictly evaluate the model's performance on a specific modality (CT chest exams) with a radiologist subspecialist baseline. We stratified the results of the reader performance study based on the diagnosis category and the original impression length to gauge case complexity.

RESULTS

The LLM achieved ROUGE-L scores of 46.51, 44.2, and 50.96 on UCSFMC and upon external validation, ROUGE-L scores of 40.74, 37.89, and 24.61 on ZSFG across the CT, US, and MRI modalities respectively, implying a substantial degree of overlap between the model-generated impressions and impressions written by the subspecialist attending radiologists, but with a degree of degradation upon external validation. In our reader study, the model-generated impressions achieved overall mean scores of 3.56/4, 3.92/4, 3.37/4, 18.29 s,12.32 words, and 84 while the original impression written by a subspecialist radiologist achieved overall mean scores of 3.75/4, 3.87/4, 3.54/4, 12.2 s, 5.74 words, and 89 for clinical accuracy, grammatical accuracy, stylistic quality, edit time, edit distance, and ROUGE-L score respectively. The LLM achieved the highest clinical accuracy ratings for acute/emergent findings and on shorter impressions.

CONCLUSIONS

An open-source fine-tuned LLM can generate impressions to a satisfactory level of clinical accuracy, grammatical accuracy, and stylistic quality. Our reader performance study demonstrates the potential of large language models in drafting radiology report impressions that can aid in streamlining radiologists' workflows.

BACKGROUND

METHODS

RESULTS

CONCLUSIONS

背景

印象部分整合了放射学报告的关键发现，但可能具有主观性和可变性。我们试图微调和评估一个开源的大型语言模型（LLM），以便在不同的成像方式和医院中自动从放射学报告的其余部分生成印象。

方法

在这项机构审查委员会批准的回顾性研究中，我们从加利福尼亚大学旧金山医学中心（UCSFMC）（n=372716）和扎克伯格旧金山综合医院和创伤中心（ZSFG）（n=60049）收集了 CT、US 和 MRI 放射学报告数据集，这两个数据集都属于同一个机构。ROUGE（Recall-Oriented Understudy for Gisting Evaluation）评分是一种自动自然语言评估指标，用于测量单词重叠度，用于自动自然语言评估。我们进行了一项读者研究，其中包括五名心胸放射科医生，以更严格地评估该模型在特定模式（CT 胸部检查）上的性能，并以放射科专家为基线。我们根据诊断类别和原始印象长度对读者表现研究的结果进行分层，以衡量病例的复杂性。

结果

该 LLM 在 UCSFMC 上的 ROUGE-L 得分分别为 46.51、44.2 和 50.96，并且在外部验证时，在 CT、US 和 MRI 模式下，ZSFG 的 ROUGE-L 得分分别为 40.74、37.89 和 24.61，这意味着模型生成的印象与放射科专家主治医生撰写的印象有很大程度的重叠，但在外部验证时存在一定程度的退化。在我们的读者研究中，模型生成的印象总体平均得分为 3.56/4、3.92/4、3.37/4、18.29 秒、12.32 个单词和 84，而由放射科专家撰写的原始印象的总体平均得分为 3.75/4、3.87/4、3.54/4、12.2 秒、5.74 个单词和 89，分别用于临床准确性、语法准确性、文体质量、编辑时间、编辑距离和 ROUGE-L 分数。该 LLM 在急性/紧急发现和较短的印象方面获得了最高的临床准确性评分。

结论

经过微调的开源 LLM 可以生成具有令人满意的临床准确性、语法准确性和文体质量的印象。我们的读者表现研究表明，大型语言模型在起草放射学报告印象方面具有潜力，可以帮助简化放射科医生的工作流程。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

开源微调大型语言模型在放射科印象生成中的应用：多读者性能研究。

An open-source fine-tuned large language model for radiological impression generation: a multi-reader performance study.

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSIONS

相似文献

引用本文的文献

本文引用的文献

开源微调大型语言模型在放射科印象生成中的应用：多读者性能研究。

An open-source fine-tuned large language model for radiological impression generation: a multi-reader performance study.

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSIONS

背景

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献