用于自动超声心动图报告分析的隐私保护大语言模型的比较分析。

A comparative analysis of privacy-preserving large language models for automated echocardiography report analysis.

作者信息

Mahmoudi Elham, Vahdati Sanaz, Chao Chieh-Ju, Khosravi Bardia, Misra Ajay, Lopez-Jimenez Francisco, Erickson Bradley J

机构信息

Department of Radiology, Radiology Informatics Lab, Mayo Clinic, Rochester, MN 55905, United States.

Department of Cardiovascular Medicine, Mayo Clinic Rochester, Rochester, MN 55905, United States.

出版信息

J Am Med Inform Assoc. 2025 Jul 1;32(7):1120-1129. doi: 10.1093/jamia/ocaf056.

DOI:10.1093/jamia/ocaf056

PMID:40334045

Abstract

BACKGROUND

Automated data extraction from echocardiography reports could facilitate large-scale registry creation and clinical surveillance of valvular heart diseases (VHD). We evaluated the performance of open-source large language models (LLMs) guided by prompt instructions and chain of thought (CoT) for this task.

METHODS

From consecutive transthoracic echocardiographies performed in our center, we utilized 200 random reports from 2019 for prompt optimization and 1000 from 2023 for evaluation. Five instruction-tuned LLMs (Qwen2.0-72B, Llama3.0-70B, Mixtral8-46.7B, Llama3.0-8B, and Phi3.0-3.8B) were guided by prompt instructions with and without CoT to classify prosthetic valve presence and VHD severity. Performance was evaluated using classification metrics against expert-labeled ground truth. Mean squared error (MSE) was also calculated for predicted severity's deviation from actual severity.

RESULTS

With CoT prompting, Llama3.0-70B and Qwen2.0 achieved the highest performance (accuracy: 99.1% and 98.9% for VHD severity; 100% and 99.9% for prosthetic valve; MSE: 0.02 and 0.05, respectively). Smaller models showed lower accuracy for VHD severity (54.1%-85.9%) but maintained high accuracy for prosthetic valve detection (>96%). Chain of thought reasoning yielded higher accuracy for larger models while increasing processing time from 2-25 to 67-154 seconds per report. Based on CoT reasonings, the wrong predictions were mainly due to model outputs being influenced by irrelevant information in the text or failure to follow the prompt instructions.

CONCLUSIONS

Our study demonstrates the near-perfect performance of open-source LLMs for automated echocardiography report interpretation with the purpose of registry formation and disease surveillance. While larger models achieved exceptional accuracy through prompt optimization, practical implementation requires balancing performance with computational efficiency.

摘要

背景

从超声心动图报告中自动提取数据有助于创建瓣膜性心脏病（VHD）的大规模登记册并进行临床监测。我们评估了由提示指令和思维链（CoT）引导的开源大语言模型（LLM）在此任务中的性能。

方法

从我们中心连续进行的经胸超声心动图检查中，我们使用了2019年的200份随机报告进行提示优化，并使用2023年的1000份报告进行评估。五个经过指令微调的LLM（Qwen2.0 - 72B、Llama3.0 - 70B、Mixtral8 - 46.7B、Llama3.0 - 8B和Phi3.0 - 3.8B）在有和没有CoT的提示指令引导下，对人工瓣膜的存在和VHD严重程度进行分类。使用针对专家标记的地面真值的分类指标评估性能。还计算了预测严重程度与实际严重程度偏差的均方误差（MSE）。

结果

通过CoT提示，Llama3.0 - 70B和Qwen2.0实现了最高性能（VHD严重程度的准确率：99.1%和98.9%；人工瓣膜的准确率：100%和99.9%；MSE分别为0.02和0.05）。较小的模型在VHD严重程度方面显示出较低的准确率（54.1% - 85.9%），但在人工瓣膜检测方面保持了较高的准确率（>96%）。思维链推理对于较大的模型产生了更高的准确率，同时每份报告的处理时间从2 - 25秒增加到67 - 154秒。基于CoT推理，错误预测主要是由于模型输出受到文本中无关信息的影响或未能遵循提示指令。