Lyo Shawn, Mohan Suyash, Hassankhani Alvand, Noor Abass, Dako Farouk, Cook Tessa
Department of Radiology, Hospital of the University of Pennsylvania, Philadelphia, PA, USA.
J Imaging Inform Med. 2025 Apr;38(2):1265-1279. doi: 10.1007/s10278-024-01233-4. Epub 2024 Aug 19.
Expert feedback on trainees' preliminary reports is crucial for radiologic training, but real-time feedback can be challenging due to non-contemporaneous, remote reading and increasing imaging volumes. Trainee report revisions contain valuable educational feedback, but synthesizing data from raw revisions is challenging. Generative AI models can potentially analyze these revisions and provide structured, actionable feedback. This study used the OpenAI GPT-4 Turbo API to analyze paired synthesized and open-source analogs of preliminary and finalized reports, identify discrepancies, categorize their severity and type, and suggest review topics. Expert radiologists reviewed the output by grading discrepancies, evaluating the severity and category accuracy, and suggested review topic relevance. The reproducibility of discrepancy detection and maximal discrepancy severity was also examined. The model exhibited high sensitivity, detecting significantly more discrepancies than radiologists (W = 19.0, p < 0.001) with a strong positive correlation (r = 0.778, p < 0.001). Interrater reliability for severity and type were fair (Fleiss' kappa = 0.346 and 0.340, respectively; weighted kappa = 0.622 for severity). The LLM achieved a weighted F1 score of 0.66 for severity and 0.64 for type. Generated teaching points were considered relevant in ~ 85% of cases, and relevance correlated with the maximal discrepancy severity (Spearman ρ = 0.76, p < 0.001). The reproducibility was moderate to good (ICC (2,1) = 0.690) for the number of discrepancies and substantial for maximal discrepancy severity (Fleiss' kappa = 0.718; weighted kappa = 0.94). Generative AI models can effectively identify discrepancies in report revisions and generate relevant educational feedback, offering promise for enhancing radiology training.
专家对实习医生初步报告的反馈对放射学培训至关重要,但由于非同步、远程阅片以及成像量不断增加,实时反馈可能具有挑战性。实习医生报告的修订包含有价值的教育反馈,但从原始修订中综合数据具有挑战性。生成式人工智能模型有可能分析这些修订并提供结构化的、可操作的反馈。本研究使用OpenAI GPT-4 Turbo API来分析初步报告和最终报告的配对合成模拟物和开源类似物,识别差异,对其严重程度和类型进行分类,并提出审查主题。放射学专家通过对差异进行评分、评估严重程度和类别准确性以及建议审查主题相关性来审查输出结果。还检查了差异检测的可重复性和最大差异严重程度。该模型表现出高灵敏度,检测到的差异明显多于放射科医生(W = 19.0,p < 0.001),且具有很强的正相关性(r = 0.778,p < 0.001)。严重程度和类型的评分者间信度一般(Fleiss' kappa分别为0.346和0.340;严重程度的加权kappa为0.622)。大语言模型在严重程度方面的加权F1得分为0.66,在类型方面为0.64。生成的教学要点在约85%的案例中被认为是相关的,相关性与最大差异严重程度相关(Spearman ρ = 0.76,p < 0.001)。差异数量的可重复性中等至良好(ICC(2,1)=0.690),最大差异严重程度的可重复性较高(Fleiss' kappa = 0.718;加权kappa = 0.94)。生成式人工智能模型可以有效地识别报告修订中的差异并生成相关的教育反馈,为加强放射学培训带来了希望。