Sanli Deniz Esin Tekcan, Sanli Ahmet Necati, Buyukdereli Atadag Yildiz, Kurt Atakan, Esmerer Emel
Department of Radiology, Faculty of Medicine, Gaziantep University, Gaziantep, Turkey.
Department of General Surgery, Abdulkadir Yuksel State Hospital, Gaziantep, Turkey.
J Ultrasound Med. 2025 Jun 23. doi: 10.1002/jum.16749.
This study aimed to evaluate the ability of ChatGPT and Breast Ultrasound Helper, a special ChatGPT-based subprogram trained on ultrasound image analysis, to analyze and differentiate benign and malignant breast lesions on ultrasound images.
Ultrasound images of histopathologically confirmed breast cancer and fibroadenoma patients were read GPT-4o (the latest ChatGPT version) and Breast Ultrasound Helper (BUH), a tool from the "Explore" section of ChatGPT. Both were prompted in English using ACR BI-RADS Breast Ultrasound Lexicon criteria: lesion shape, orientation, margin, internal echo pattern, echogenicity, posterior acoustic features, microcalcifications or hyperechoic foci, perilesional hyperechoic rim, edema or architectural distortion, lesion size, and BI-RADS category. Two experienced radiologists evaluated the images and the responses of the programs in consensus. The outputs, BI-RADS category agreement, and benign/malignant discrimination were statistically compared.
A total of 232 ultrasound images were analyzed, of which 133 (57.3%) were malignant and 99 (42.7%) benign. In comparative analysis, BUH showed superior performance overall, with higher kappa values and statistically significant results across multiple features (P .001). However, the overall level of agreement with the radiologists' consensus for all features was similar for BUH (κ: 0.387-0.755) and GPT-4o (κ: 0.317-0.803). On the other hand, BI-RADS category agreement was slightly higher in GPT-4o than in BUH (69.4% versus 65.9%), but BUH was slightly more successful in distinguishing benign lesions from malignant lesions (65.9% versus 67.7%).
Although both AI tools show moderate-good performance in ultrasound image analysis, their limited compatibility with radiologists' evaluations and BI-RADS categorization suggests that their clinical application in breast ultrasound interpretation is still early and unreliable.
本研究旨在评估ChatGPT以及基于超声图像分析训练的特殊的基于ChatGPT的子程序“乳腺超声助手”对超声图像上的乳腺良恶性病变进行分析和鉴别的能力。
对组织病理学确诊的乳腺癌和纤维腺瘤患者的超声图像由GPT-4o(最新版ChatGPT)和ChatGPT“探索”板块中的工具“乳腺超声助手”(BUH)进行解读。两者均使用美国放射学会(ACR)乳腺影像报告和数据系统(BI-RADS)乳腺超声术语标准以英文进行提问:病变形状、方向、边缘、内部回声模式、回声性、后方声学特征、微钙化或高回声灶、病灶周围高回声边缘、水肿或结构扭曲、病变大小以及BI-RADS分类。两名经验丰富的放射科医生共同评估图像及程序的回答。对输出结果、BI-RADS分类一致性以及良恶性鉴别进行统计学比较。
共分析了232幅超声图像,其中133幅(57.3%)为恶性,99幅(42.7%)为良性。在对比分析中,总体而言BUH表现更优,kappa值更高,且在多个特征上结果具有统计学意义(P < 0.001)。然而,对于所有特征,BUH(κ:0.387 - 0.755)与GPT-4o(κ:0.317 - 0.803)与放射科医生共识的总体一致水平相似。另一方面,GPT-4o的BI-RADS分类一致性略高于BUH(69.4%对65.9%),但在区分良性病变与恶性病变方面BUH略更成功(65.9%对67.7%)。
尽管这两种人工智能工具在超声图像分析中均表现出中等良好的性能,但它们与放射科医生评估及BI-RADS分类的兼容性有限,这表明它们在乳腺超声解读中的临床应用仍处于早期且不可靠。