Güzel Hamza Eren, Aşcı Göktuğ, Demirbilek Oytun, Özdemir Tuğçe Doğa, Erekli Pelin Berfin
Department of Radiology, İzmir City Hospital, İzmir, Türkiye.
School of Science and Technology, IE University, Segovia, Spain.
Front Radiol. 2025 May 9;5:1509377. doi: 10.3389/fradi.2025.1509377. eCollection 2025.
This study aimed to determine the diagnostic precision of a deep learning algorithm for the classificaiton of non-contrast brain CT reports.
A total of 1,861 non-contrast brain CT reports were randomly selected, anonymized, and annotated for urgency level by two radiologists, with review by a senior radiologist. The data, encrypted and stored in Excel format, were securely maintained on a university cloud system. Using Python 3.8.16, the reports were classified into four urgency categories: emergency, not emergency but needs timely attention, clinically non-significant and normal. The dataset was split, with 800 reports used for training and 200 for validation. The DistilBERT model, featuring six transformer layers and 66 million trainable parameters, was employed for text classification. Training utilized the Adam optimizer with a learning rate of 2e-5, a batch size of 32, and a dropout rate of 0.1 to prevent overfitting. The model achieved a mean F1 score of 0.85 through 5-fold cross-validation, demonstrating strong performance in categorizing radiology reports.
Of the 1,861 scans, 861 cases were identified as fit for study through the senior radiologist and self-hosted Label Studio interpretations. It was observed that the algorithm achieved a sensitivity of 91% and a specificity of 90% in the measurements made on the test data. The F1 score was measured as 0.89 for the best fold. The algorithm most successfully distinguished emergency results with positive predictive values that were unexpectedly lower than in previously reported studies. Beam hardening artifacts and excessive noise, compromising the quality of CT scan images, were significantly associated with decreased model performance.
This study revealed decreased diagnostic accuracy of an AI decision support system (DSS) at our institution. Despite extensive evaluation, we were unable to identify the source of this discrepancy, raising concerns about the generalizability of these tools with indeterminate failure modes. These results further highlight the need for standardized study design to allow for rigorous and reproducible site-to-site comparison of emerging deep learning technologies.
本研究旨在确定一种深度学习算法对非增强脑CT报告进行分类的诊断精度。
总共随机选择了1861份非增强脑CT报告,进行匿名处理,并由两名放射科医生标注紧急程度,再由一名资深放射科医生进行审核。这些数据以加密形式存储在Excel格式中,并安全地保存在大学云系统上。使用Python 3.8.16,将报告分为四个紧急类别:紧急、非紧急但需及时关注、临床无意义和正常。数据集被拆分,800份报告用于训练,200份用于验证。采用具有六个Transformer层和6600万个可训练参数的DistilBERT模型进行文本分类。训练使用Adam优化器,学习率为2e-5,批量大小为32,丢弃率为0.1以防止过拟合。该模型通过5折交叉验证获得了0.85的平均F1分数,在对放射学报告进行分类方面表现出强大性能。
在186处扫描中,通过资深放射科医生和自托管的Label Studio解释,确定有861例适合研究。观察到该算法在对测试数据的测量中灵敏度达到91%,特异性达到90%。最佳折的F1分数为0.89。该算法最成功地区分了紧急结果,但其阳性预测值意外低于先前报道的研究。影响CT扫描图像质量的线束硬化伪影和过多噪声与模型性能下降显著相关。
本研究揭示了我们机构中人工智能决策支持系统(DSS)的诊断准确性有所下降。尽管进行了广泛评估,但我们无法确定这种差异的来源,这引发了对这些具有不确定故障模式的工具的通用性的担忧。这些结果进一步凸显了标准化研究设计的必要性,以便对新兴深度学习技术进行严格且可重复的站点间比较。