Suppr超能文献

大语言模型在检测中文超声报告错误中的应用

The use of large language models in detecting Chinese ultrasound report errors.

作者信息

Yan Yuqi, Wang Kai, Feng Bojian, Yao Jincao, Jiang Tian, Jin Zhiyan, Zheng Yin, Zhou Yahan, Chen Chen, Sui Lin, Chen Xiayi, Du Yanhong, Yang Jie, Pan Qianmeng, Zhou Lingyan, Wang Vicky Yang, Liang Ping, Xu Dong

机构信息

Department of Diagnostic Ultrasound Imaging & Interventional Therapy, Zhejiang Cancer Hospital, Hangzhou, Zhejiang, China.

Center of Intelligent Diagnosis and Therapy (Taizhou), Hangzhou Institute of Medicine (HIM), Chinese Academy of Sciences, Taizhou, Zhejiang, China.

出版信息

NPJ Digit Med. 2025 Jan 28;8(1):66. doi: 10.1038/s41746-025-01468-7.

Abstract

This retrospective study evaluated the efficacy of large language models (LLMs) in improving the accuracy of Chinese ultrasound reports. Data from three hospitals (January-April 2024) including 400 reports with 243 errors across six categories were analyzed. Three GPT versions and Claude 3.5 Sonnet were tested in zero-shot settings, with the top two models further assessed in few-shot scenarios. Six radiologists of varying experience levels performed error detection on a randomly selected test set. In zero-shot setting, Claude 3.5 Sonnet and GPT-4o achieved the highest error detection rates (52.3% and 41.2%, respectively). In few-shot, Claude 3.5 Sonnet outperformed senior and resident radiologists, while GPT-4o excelled in spelling error detection. LLMs processed reports faster than the quickest radiologist (Claude 3.5 Sonnet: 13.2 s, GPT-4o: 15.0 s, radiologist: 42.0 s per report). This study demonstrates the potential of LLMs to enhance ultrasound report accuracy, outperforming human experts in certain aspects.

摘要

这项回顾性研究评估了大语言模型(LLMs)在提高中文超声报告准确性方面的功效。分析了来自三家医院(2024年1月至4月)的数据,包括400份报告,其中有243处错误,分为六个类别。在零样本设置下测试了三个GPT版本和Claude 3.5 Sonnet,对表现最佳的两个模型在少样本场景中进行了进一步评估。六位经验水平不同的放射科医生对随机选择的测试集进行了错误检测。在零样本设置中,Claude 3.5 Sonnet和GPT-4o实现了最高的错误检测率(分别为52.3%和41.2%)。在少样本设置中,Claude 3.5 Sonnet的表现优于资深放射科医生和住院放射科医生,而GPT-4o在拼写错误检测方面表现出色。大语言模型处理报告的速度比最快的放射科医生还要快(Claude 3.5 Sonnet:每份报告13.2秒,GPT-4o:每份报告15.0秒,放射科医生:每份报告42.0秒)。这项研究证明了大语言模型在提高超声报告准确性方面的潜力,在某些方面优于人类专家。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6329/11775253/4d7eb710e267/41746_2025_1468_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验