Al Mohamad Fares, Donle Leonhard, Dorfner Felix, Romanescu Laura, Drechsler Kristin, Wattjes Mike P, Nawabi Jawed, Makowski Marcus R, Häntze Hartmut, Adams Lisa, Xu Lina, Busch Felix, Meddeb Aymen, Bressem Keno Kyrill
Department of Radiology, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Charitéplatz 1, 10117 Berlin, Germany (F.A.M., L.D., F.D., L.R., K.D., H.H., L.X., F.B.).
Department of Radiology, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Charitéplatz 1, 10117 Berlin, Germany (F.A.M., L.D., F.D., L.R., K.D., H.H., L.X., F.B.); Department of Obstetrics & Gynecology, University of Chicago, 5758 S Maryland Ave, Chicago, IL 60637 (L.D.).
Acad Radiol. 2025 May;32(5):2402-2410. doi: 10.1016/j.acra.2024.12.028. Epub 2025 Jan 6.
Training Convolutional Neural Networks (CNN) requires large datasets with labeled data, which can be very labor-intensive to prepare. Radiology reports contain a lot of potentially useful information for such tasks. However, they are often unstructured and cannot be directly used for training. The recent progress in large language models (LLMs) might introduce a new useful tool in interpreting radiology reports. This study aims to explore the use of the LLM to classify radiology reports and generate labels. These labels will be utilized then to train a CNN to detect ankle fractures to evaluate the effectiveness of using automatically generated labels.
We used the open-weight LLM Mixtral-8×7B-Instruct-v0.1 to classify radiology reports of ankle x-ray images. The generated labels were used to train a CNN to recognize ankle fractures. The model's accuracy, sensitivity, specificity, and area under the receiver operating characteristics curve were used for evaluation.
Using common prompt engineering techniques, a prompt was found that reached an accuracy of 92% on a test dataset. By parsing all radiology reports using the LLM, a training dataset of 15,896 images and labels was created. Using this dataset, a CNN was trained, which achieved an accuracy of 89.5% and an area under the receiver operating characteristic curve of 0.926 on a test dataset.
Our classification model based on labels generated with a large language model achieved high accuracy. This performance is comparable to models trained with manually labeled data, demonstrating the potential of language models in automating the labeling process.
Large language models can be used to reliably detect pathologies in radiology reports.
In this study, 7561 radiological reports of ankle X-ray images were automatically classified as describing an ankle fracture or not using a large language model. Using a dataset of 250 reports, the language model showed a classification accuracy of 92%. The generated labels were used to train an image classifier to detect ankle fractures on X-ray images. 15,896 images were used for training. The resulting model achieved an accuracy of 89.5% on a test dataset.
训练卷积神经网络(CNN)需要带有标注数据的大型数据集,而准备这些数据集可能非常耗费人力。放射学报告包含许多对此类任务潜在有用的信息。然而,它们通常是非结构化的,不能直接用于训练。大语言模型(LLM)的最新进展可能会为解读放射学报告引入一种新的有用工具。本研究旨在探索使用大语言模型对放射学报告进行分类并生成标注。然后将这些标注用于训练一个CNN来检测踝关节骨折,以评估使用自动生成的标注的有效性。
我们使用开放权重的大语言模型Mixtral - 8×7B - Instruct - v0.1对踝关节X线图像的放射学报告进行分类。生成的标注用于训练一个CNN以识别踝关节骨折。使用模型的准确率、灵敏度、特异性以及受试者工作特征曲线下面积进行评估。
使用常见的提示工程技术,找到了一个在测试数据集上准确率达到92%的提示。通过使用大语言模型解析所有放射学报告,创建了一个包含15896张图像和标注的训练数据集。使用这个数据集训练了一个CNN,其在测试数据集上的准确率为89.5%,受试者工作特征曲线下面积为0.926。
我们基于大语言模型生成的标注的分类模型取得了高精度。这一性能与使用人工标注数据训练的模型相当,证明了语言模型在自动化标注过程中的潜力。
大语言模型可用于可靠地检测放射学报告中的病变。
在本研究中,使用大语言模型将7561份踝关节X线图像的放射学报告自动分类为是否描述了踝关节骨折。使用一个包含250份报告的数据集,语言模型的分类准确率为92%。生成的标注用于训练一个图像分类器以检测X线图像上的踝关节骨折。15896张图像用于训练。所得模型在测试数据集上的准确率为89.5%。