一种作为甲状腺结节恶性风险端到端分类器的多模态大语言模型：可用性研究

A Multimodal Large Language Model as an End-to-End Classifier of Thyroid Nodule Malignancy Risk: Usability Study.

作者信息

Sng Gerald Gui Ren, Xiang Yi, Lim Daniel Yan Zheng, Tung Joshua Yi Min, Tan Jen Hong, Chng Chiaw Ling

机构信息

Department of Endocrinology, Singapore General Hospital, 20 College Road, Academia Level 3, Singapore, 169856, Singapore, 65 63214377.

Data Science and Artificial Intelligence Laboratory, Singapore General Hospital, Singapore, Singapore.

出版信息

JMIR Form Res. 2025 Aug 19;9:e70863. doi: 10.2196/70863.

DOI:10.2196/70863

PMID:40829145

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12364431/

Abstract

BACKGROUND

Thyroid nodules are common, with ultrasound imaging as the primary modality for their assessment. Risk stratification systems like the American College of Radiology Thyroid Imaging Reporting and Data System (ACR TI-RADS) have been developed but suffer from interobserver variability and low specificity. Artificial intelligence, particularly large language models (LLMs) with multimodal capabilities, presents opportunities for efficient end-to-end diagnostic processes. However, their clinical utility remains uncertain.

OBJECTIVE

This study evaluates the accuracy and consistency of multimodal LLMs for thyroid nodule risk stratification using the ACR TI-RADS system, examining the effects of model fine-tuning, image annotation, prompt engineering, and comparing open-source versus commercial models.

METHODS

In total, 3 multimodal vision-language models were evaluated: Microsoft's open-source Large Language and Visual Assistant (LLaVA) model, its medically fine-tuned variant (Large Language and Vision Assistant for bioMedicine [LLaVA-Med]), and OpenAI's commercial o3 model. A total of 192 thyroid nodules from publicly available ultrasound image datasets were assessed. Each model was evaluated using 2 prompts (basic and modified) and 2 image scenarios (unlabeled vs radiologist-annotated), yielding 6912 responses. Model outputs were compared with expert ratings for accuracy and consistency. Statistical comparisons included Chi-square tests, Mann-Whitney U tests, and Fleiss' kappa for interrater reliability.

RESULTS

Overall, 88.4% (6110/6912) of responses were valid, with the o3 model producing the highest validity rate (2273/2304, 98.6%), followed by LLaVA (2108/2304, 91.5%) and LLaVA-Med (1729/2304, 75%; P<.001). The o3 model demonstrated the highest accuracy overall, achieving up to 57.3% accuracy in Thyroid Imaging Reporting and Data System (TI-RADS) classification, although still remaining suboptimal. Labeled images improved accuracy marginally in nodule margin assessment only when evaluating LLaVA models (407/768, 53% to 447/768, 58.2%; P=.04). Prompt engineering improved accuracy for composition (649/1,152, 56.3% vs 483/1152, 41.9%; P<.001), but significantly reduced accuracy for shape, margins, and overall classification. Consistency was the highest with the o3 model (up to 85.4%), but was comparable for LLaVA and significantly improved with image labeling and modified prompts across multiple TI-RADS categories (P<.001). Subgroup analysis for o3 alone showed prompt engineering did not affect accuracy significantly but markedly improved consistency across all TI-RADS categories (up to 97.1% for shape, P<.001). Interrater reliability was consistently poor across all combinations (Fleiss' kappa<0.60).

CONCLUSIONS

The study demonstrates the comparative advantages and limitations of multimodal LLMs for thyroid nodule risk stratification. While the commercial model (o3) consistently outperformed open-source models in accuracy and consistency, even the best-performing model outputs remained suboptimal for direct clinical deployment. Prompt engineering significantly enhanced output consistency, particularly in the commercial model. These findings underline the importance of strategic model optimization techniques and highlight areas requiring further development before multimodal LLMs can be reliably used in clinical thyroid imaging workflows.

摘要

背景

甲状腺结节很常见，超声成像为其主要评估方式。诸如美国放射学会甲状腺影像报告和数据系统（ACR TI-RADS）等风险分层系统已被开发出来，但存在观察者间差异和低特异性的问题。人工智能，尤其是具有多模态能力的大语言模型（LLMs），为高效的端到端诊断流程带来了机遇。然而，它们的临床效用仍不确定。

目的

本研究使用ACR TI-RADS系统评估多模态大语言模型对甲状腺结节风险分层的准确性和一致性，研究模型微调、图像标注、提示工程的效果，并比较开源模型与商业模型。

方法

共评估了3个多模态视觉语言模型：微软的开源大语言与视觉助手（LLaVA）模型、其医学微调变体（生物医学大语言与视觉助手[LLaVA-Med]）以及OpenAI的商业o3模型。对公开可用超声图像数据集中的192个甲状腺结节进行了评估。每个模型使用2种提示（基本提示和修改提示）和2种图像场景（未标注与放射科医生标注）进行评估，共产生6912个回答。将模型输出与专家评级进行准确性和一致性比较。统计比较包括卡方检验、曼-惠特尼U检验以及用于评估评分者间可靠性的Fleiss卡方检验。

结果

总体而言，88.4%（6110/6912）的回答有效，o3模型的有效率最高（2273/2304，98.6%），其次是LLaVA（2108/2304，91.5%）和LLaVA-Med（1729/2304，75%；P<0.001）。o3模型总体准确率最高，在甲状腺影像报告和数据系统（TI-RADS）分类中准确率高达57.3%，尽管仍未达到最佳水平。仅在评估LLaVA模型时，标注图像在结节边缘评估中略微提高了准确率（407/768，53%至447/768，58.2%；P=0.04）。提示工程提高了成分评估的准确率（649/1152，56.3%对483/1152，41.9%；P<0.001），但显著降低了形状、边缘和总体分类的准确率。o3模型的一致性最高（高达85.4%），但LLaVA的一致性与之相当，并且通过图像标注和修改提示在多个TI-RADS类别中显著提高（P<0.001）。单独对o3模型的亚组分析表明，提示工程对准确率没有显著影响，但显著提高了所有TI-RADS类别的一致性（形状一致性高达97.1%，P<0.001）。所有组合的评分者间可靠性一直较差（Fleiss卡方<0.60）。

结论

该研究展示了多模态大语言模型在甲状腺结节风险分层方面的比较优势和局限性。虽然商业模型（o3）在准确性和一致性方面始终优于开源模型，但即使是表现最佳的模型输出在直接临床应用中仍未达到最佳水平。提示工程显著提高了输出一致性，尤其是在商业模型中。这些发现强调了战略模型优化技术的重要性，并突出了在多模态大语言模型能够可靠应用于临床甲状腺影像工作流程之前需要进一步发展的领域。