Brin Dana, Sorin Vera, Barash Yiftach, Konen Eli, Glicksberg Benjamin S, Nadkarni Girish N, Klang Eyal
Department of Diagnostic Imaging, Chaim Sheba Medical Center, Tel Hashomer, Israel.
Faculty of Medicine, Tel-Aviv University, Tel Aviv-Yafo, Israel.
Eur Radiol. 2025 Apr;35(4):1959-1965. doi: 10.1007/s00330-024-11035-5. Epub 2024 Aug 30.
This study aims to assess the performance of a multimodal artificial intelligence (AI) model capable of analyzing both images and textual data (GPT-4V), in interpreting radiological images. It focuses on a range of modalities, anatomical regions, and pathologies to explore the potential of zero-shot generative AI in enhancing diagnostic processes in radiology.
We analyzed 230 anonymized emergency room diagnostic images, consecutively collected over 1 week, using GPT-4V. Modalities included ultrasound (US), computerized tomography (CT), and X-ray images. The interpretations provided by GPT-4V were then compared with those of senior radiologists. This comparison aimed to evaluate the accuracy of GPT-4V in recognizing the imaging modality, anatomical region, and pathology present in the images.
GPT-4V identified the imaging modality correctly in 100% of cases (221/221), the anatomical region in 87.1% (189/217), and the pathology in 35.2% (76/216). However, the model's performance varied significantly across different modalities, with anatomical region identification accuracy ranging from 60.9% (39/64) in US images to 97% (98/101) and 100% (52/52) in CT and X-ray images (p < 0.001). Similarly, pathology identification ranged from 9.1% (6/66) in US images to 36.4% (36/99) in CT and 66.7% (34/51) in X-ray images (p < 0.001). These variations indicate inconsistencies in GPT-4V's ability to interpret radiological images accurately.
While the integration of AI in radiology, exemplified by multimodal GPT-4, offers promising avenues for diagnostic enhancement, the current capabilities of GPT-4V are not yet reliable for interpreting radiological images. This study underscores the necessity for ongoing development to achieve dependable performance in radiology diagnostics.
Although GPT-4V shows promise in radiological image interpretation, its high diagnostic hallucination rate (> 40%) indicates it cannot be trusted for clinical use as a standalone tool. Improvements are necessary to enhance its reliability and ensure patient safety.
GPT-4V's capability in analyzing images offers new clinical possibilities in radiology. GPT-4V excels in identifying imaging modalities but demonstrates inconsistent anatomy and pathology detection. Ongoing AI advancements are necessary to enhance diagnostic reliability in radiological applications.
本研究旨在评估一种能够分析图像和文本数据的多模态人工智能(AI)模型(GPT-4V)在解读放射影像方面的性能。研究聚焦于一系列模态、解剖区域和病理情况,以探索零样本生成式AI在增强放射诊断流程中的潜力。
我们使用GPT-4V分析了连续1周收集的230份匿名急诊室诊断影像。模态包括超声(US)、计算机断层扫描(CT)和X线影像。然后将GPT-4V给出的解读与资深放射科医生的解读进行比较。该比较旨在评估GPT-4V在识别影像模态、解剖区域和影像中存在的病理情况方面的准确性。
GPT-4V在100%的病例(221/221)中正确识别了影像模态,在87.1%(189/217)的病例中正确识别了解剖区域,在35.2%(76/216)的病例中正确识别了病理情况。然而,该模型的性能在不同模态间存在显著差异,解剖区域识别准确率在超声影像中为60.9%(39/64),在CT影像和X线影像中分别为97%(98/101)和100%(52/52)(p<0.001)。同样,病理识别率在超声影像中为9.1%(6/66),在CT影像中为36.4%(36/99),在X线影像中为66.7%(34/51)(p<0.001)。这些差异表明GPT-4V准确解读放射影像的能力存在不一致性。
虽然以多模态GPT-4为代表的AI在放射学中的整合为诊断增强提供了有前景的途径,但GPT-4V目前的能力在解读放射影像方面尚不可靠。本研究强调了持续开发以在放射诊断中实现可靠性能的必要性。
尽管GPT-4V在放射影像解读方面显示出前景,但其高诊断幻觉率(>40%)表明它不能作为独立工具用于临床。有必要进行改进以提高其可靠性并确保患者安全。
GPT-4V分析图像的能力为放射学带来了新的临床可能性。GPT-4V在识别影像模态方面表现出色,但在解剖结构和病理检测方面表现不一致。持续的AI进展对于提高放射学应用中的诊断可靠性是必要的。