Suppr超能文献

复杂多语言肺癌病理报告标准化数据库生成中准确性和完整性的比较分析:基于大语言模型的辅助诊断系统与DeepSeek、GPT-3.5以及不同职称医疗专业人员的比较,并对医务人员的任务负荷变化进行评估

Comparative analysis of accuracy and completeness in standardized database generation for complex multilingual lung cancer pathological reports: large language model-based assisted diagnosis system vs. DeepSeek, GPT-3.5, and healthcare professionals with varied professional titles, with task load variation assessment among medical staff.

作者信息

Hang Hao, Yang Liankai, Wang Zhongjie, Lin Zhebing, Li Pengchong, Zhu Jiayue, Liu Rang, Pu Shuai, Cheng Xinghua

机构信息

Graduate School, Bengbu Medical University, Bengbu, Anhui, China.

Department of Oncology, Shanghai Lung Cancer Center, Shanghai Chest Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China.

出版信息

Front Med (Lausanne). 2025 Aug 22;12:1618858. doi: 10.3389/fmed.2025.1618858. eCollection 2025.

Abstract

BACKGROUND

This study evaluates how AI enhances EHR efficiency by comparing a lung cancer-specific LLM with general-purpose models (DeepSeek, GPT-3.5) and clinicians across expertise levels, assessing accuracy and completeness in complex lung cancer pathology documentation and task load changes pre-/post-AI implementation.

METHODS

This study analyzed 300 lung cancer cases (Shanghai Chest Hospital) and 60 TCGA cases, split into training/validation/test sets. Ten clinicians (varying expertise) and three AI models (GPT-3.5, DeepSeek, lung cancer-specific LLM) generated pathology reports. Accuracy/completeness were evaluated against LeapFrog/Joint Commission/ACS standards (non-parametric tests); task load changes pre/post-AI implementation were assessed via NASA-TLX (paired -tests,  < 0.05).

RESULTS

This study analyzed 1,390 structured pathology databases: 1,300 from 100 Chinese cases (generated by 10 clinicians and three LLMs) and 90 from 30 TCGA English reports. The lung cancer-specific LLM outperformed nurses, residents, interns, and general AI models (DeepSeek, GPT-3.5) in lesion/lymph node analysis and pathology extraction for Chinese records ( < 0.05), with total scores slightly below chief physicians. In English reports, it matched mainstream AI in lesion analysis ( > 0.05) but excelled in lymph node/pathology metrics ( < 0.05). Task load scores decreased by 38.3% post-implementation (413.90 ± 78.09 vs. 255.30 ± 65.50,  = 26.481,  < 0.001).

CONCLUSION

The fine-tuned lung cancer LLM outperformed non-chief physicians and general LLMs in accuracy/completeness, significantly reduced medical staff workload ( < 0.001), with future optimization potential despite current limitations.

摘要

背景

本研究通过将肺癌特异性大语言模型与通用模型(DeepSeek、GPT-3.5)以及不同专业水平的临床医生进行比较,评估人工智能如何提高电子健康记录的效率,同时评估在复杂的肺癌病理记录中人工智能实施前后的准确性和完整性以及任务负荷变化。

方法

本研究分析了300例肺癌病例(上海胸科医院)和60例TCGA病例,并将其分为训练/验证/测试集。10名临床医生(专业水平各异)和3个人工智能模型(GPT-3.5、DeepSeek、肺癌特异性大语言模型)生成病理报告。根据LeapFrog/联合委员会/美国癌症协会标准(非参数检验)评估准确性/完整性;通过NASA-TLX评估人工智能实施前后的任务负荷变化(配对检验,P<0.05)。

结果

本研究分析了1390个结构化病理数据库:1300个来自100例中国病例(由10名临床医生和3个大语言模型生成),90个来自30份TCGA英文报告。肺癌特异性大语言模型在中文记录的病变/淋巴结分析和病理提取方面优于护士、住院医师、实习医师和通用人工智能模型(DeepSeek、GPT-3.5)(P<0.05),总分略低于主任医师。在英文报告中,它在病变分析方面与主流人工智能相当(P>0.05),但在淋巴结/病理指标方面表现出色(P<0.05)。实施后任务负荷得分下降了38.3%(413.90±78.09对255.30±65.50,t=26.481,P<0.001)。

结论

经过微调的肺癌大语言模型在准确性/完整性方面优于非主任医师和通用大语言模型,显著降低了医务人员的工作量(P<0.001),尽管目前存在局限性,但未来仍有优化潜力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/61e7/12411430/ec25ca9d73f8/fmed-12-1618858-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验