Matute-González Mario, Darnell Anna, Comas-Cufí Marc, Pazó Javier, Soler Alexandre, Saborido Belén, Mauro Ezequiel, Turnes Juan, Forner Alejandro, Reig María, Rimola Jordi
BCLC Group, Radiology Department, Hospital Clínic of Barcelona, IDIBAPS, Barcelona, Spain.
Computer Science, Applied Mathematics and Statistics Department, University of Girona, Girona, Spain.
Insights Imaging. 2024 Nov 22;15(1):280. doi: 10.1186/s13244-024-01850-1.
To develop a domain-specific large language model (LLM) for LI-RADS v2018 categorization of hepatic observations based on free-text descriptions extracted from MRI reports.
This retrospective study included 291 small liver observations, divided into training (n = 141), validation (n = 30), and test (n = 120) datasets. Of these, 120 were fictitious, and 171 were extracted from 175 MRI reports from a single institution. The algorithm's performance was compared to two independent radiologists and one hepatologist in a human replacement scenario, and considering two combined strategies (double reading with arbitration and triage). Agreement on LI-RADS category and dichotomic malignancy (LR-4, LR-5, and LR-M) were estimated using linear-weighted κ statistics and Cohen's κ, respectively. Sensitivity and specificity for LR-5 were calculated. The consensus agreement of three other radiologists served as the ground truth.
The model showed moderate agreement against the ground truth for both LI-RADS categorization (κ = 0.54 [95% CI: 0.42-0.65]) and the dichotomized approach (κ = 0.58 [95% CI: 0.42-0.73]). Sensitivity and specificity for LR-5 were 0.76 (95% CI: 0.69-0.86) and 0.96 (95% CI: 0.91-1.00), respectively. When the chatbot was used as a triage tool, performance improved for LI-RADS categorization (κ = 0.86/0.87 for the two independent radiologists and κ = 0.76 for the hepatologist), dichotomized malignancy (κ = 0.94/0.91 and κ = 0.87) and LR-5 identification (1.00/0.98 and 0.85 sensitivity, 0.96/0.92 and 0.92 specificity), with no statistical significance compared to the human readers' individual performance. Through this strategy, the workload decreased by 45%.
LI-RADS v2018 categorization from unlabelled MRI reports is feasible using our LLM, and it enhances the efficiency of data curation.
Our proof-of-concept study provides novel insights into the potential applications of LLMs, offering a real-world example of how these tools could be integrated into a local workflow to optimize data curation for research purposes.
Automatic LI-RADS categorization from free-text reports would be beneficial to workflow and data mining. LiverAI, a GPT-4-based model, supported various strategies improving data curation efficiency by up to 60%. LLMs can integrate into workflows, significantly reducing radiologists' workload.
基于从MRI报告中提取的自由文本描述,开发一种用于肝脏影像报告和数据系统(LI-RADS)v2018肝脏观察分类的特定领域大语言模型(LLM)。
这项回顾性研究纳入了291个肝脏小观察,分为训练集(n = 141)、验证集(n = 30)和测试集(n = 120)。其中,120个是虚构的,171个是从单个机构的175份MRI报告中提取的。在人工替代场景中,将该算法的性能与两名独立放射科医生和一名肝病专家进行比较,并考虑两种联合策略(双重阅片并仲裁和分流)。分别使用线性加权κ统计量和科恩κ系数评估LI-RADS类别和二分法恶性肿瘤(LR-4、LR-5和LR-M)的一致性。计算LR-5的敏感性和特异性。另外三名放射科医生的共识一致性作为金标准。
该模型在LI-RADS分类(κ = 0.54 [95% CI:0.42 - 0.65])和二分法方法(κ = 0.58 [95% CI:0.42 - 0.73])方面与金标准显示出中等一致性。LR-5的敏感性和特异性分别为0.76(95% CI:0.69 - 0.86)和0.96(95% CI:0.91 - 1.00)。当聊天机器人用作分流工具时,LI-RADS分类(两名独立放射科医生的κ = 0.86/0.87,肝病专家的κ = 0.76)、二分法恶性肿瘤(κ = 0.94/0.91和κ = 0.87)和LR-5识别(敏感性为1.00/0.98和0.85,特异性为0.96/0.92和0.92)的性能有所提高,与人类读者的个体表现相比无统计学意义。通过这种策略,工作量减少了45%。
使用我们的LLM从未标记的MRI报告中进行LI-RADS v2018分类是可行的,并且提高了数据管理的效率。
我们的概念验证研究为LLM的潜在应用提供了新见解,提供了一个真实世界的例子,说明这些工具如何可以集成到本地工作流程中,以优化研究目的的数据管理。
从自由文本报告中自动进行LI-RADS分类将有利于工作流程和数据挖掘。LiverAI,一种基于GPT-4的模型,支持多种策略,将数据管理效率提高了60%。LLM可以集成到工作流程中,显著减少放射科医生的工作量。