• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于电子病历错误检测与纠正的大语言模型中的训练时和测试时计算:一项回顾性研究

Train-Time and Test-Time Computation in Large Language Models for Error Detection and Correction in Electronic Medical Records: A Retrospective Study.

作者信息

Cai Qiong, Yang Lanting, Xiao Jiangping, Ma Jiale, Liu Molei, Pan Xilong

机构信息

Department of Social Medicine and Health Education, School of Public Health, Peking University, Beijing 100191, China.

Department of Biostatistics, Peking University Health Science Center, Beijing 100191, China.

出版信息

Diagnostics (Basel). 2025 Jul 21;15(14):1829. doi: 10.3390/diagnostics15141829.

DOI:10.3390/diagnostics15141829
PMID:40722578
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12293163/
Abstract

This study examines the effectiveness of train-time computation, test-time computation, and their combination on the performance of large language modeling applied to an electronic medical record quality management system. It identifies the most effective combination of models to enhance clinical documentation performance and efficiency. A total of 597 clinical medical records were selected from the MEDEC-MS dataset, 10 of which were used for prompt engineering to guide model training. Eight large language models were employed for training, focusing on train-time computation and test-time computation. Model performance on specific error types was assessed using precision, recall, F1 score, and error correction accuracy. The dataset was divided into training and testing sets in a 7:3 ratio. The assembly model was created using binary logistic regression for assembly analysis of the top-performing models. Its performance was evaluated using area under the curve values and model weights. GPT-4 and Deepseek R1 demonstrated higher overall accuracy in detecting errors. Models that focus on train-time computation exhibited shorter reasoning times and stricter error detection, while models emphasizing test-time computation achieved higher error correction accuracy. The GPT-4 model was particularly effective in addressing issues related to causal organisms, management, and pharmacotherapy, whereas models focusing on test-time computation performed better in tasks involving diagnosis and treatment. The assembly model, focusing on both train-time computation and test-time computation, outperformed any single large language model (Assembly model accuracy: 0.690 vs. GPT-4 accuracy: 0.477). Models focusing on train-time computation demonstrated greater efficiency in processing speed, while models focusing on test-time computation showed higher accuracy and interpretability in identifying and detecting quality issues in electronic medical records. Assembling the train-time and test-time computation strategies may strike a balance between high accuracy and model efficiency, thereby enhancing the development of electronic medical records and improving medical care.

摘要

本研究考察了训练时计算、测试时计算及其组合对应用于电子病历质量管理系统的大语言模型性能的有效性。它确定了提高临床文档性能和效率的最有效模型组合。从MEDEC-MS数据集中总共选取了597份临床病历,其中10份用于提示工程以指导模型训练。使用了八个大语言模型进行训练,重点关注训练时计算和测试时计算。使用精确率、召回率、F1分数和纠错准确率评估模型在特定错误类型上的性能。数据集以7:3的比例分为训练集和测试集。使用二元逻辑回归创建组装模型,用于对表现最佳的模型进行组装分析。使用曲线下面积值和模型权重评估其性能。GPT-4和渊思R1在检测错误方面表现出更高的总体准确率。专注于训练时计算的模型推理时间更短,错误检测更严格,而强调测试时计算的模型纠错准确率更高。GPT-4模型在解决与病原体、管理和药物治疗相关的问题方面特别有效,而专注于测试时计算的模型在涉及诊断和治疗的任务中表现更好。同时关注训练时计算和测试时计算的组装模型优于任何单个大语言模型(组装模型准确率:0.690 vs. GPT-4准确率:0.477)。专注于训练时计算的模型在处理速度上表现出更高的效率,而专注于测试时计算的模型在识别和检测电子病历质量问题方面具有更高的准确性和可解释性。结合训练时和测试时的计算策略可能在高精度和模型效率之间取得平衡,从而促进电子病历的发展并改善医疗护理。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f90f/12293163/48fa7457c35f/diagnostics-15-01829-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f90f/12293163/afb8bc4bd873/diagnostics-15-01829-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f90f/12293163/b9d01469047d/diagnostics-15-01829-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f90f/12293163/cdfb7c12baae/diagnostics-15-01829-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f90f/12293163/eda8b36bc43c/diagnostics-15-01829-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f90f/12293163/48fa7457c35f/diagnostics-15-01829-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f90f/12293163/afb8bc4bd873/diagnostics-15-01829-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f90f/12293163/b9d01469047d/diagnostics-15-01829-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f90f/12293163/cdfb7c12baae/diagnostics-15-01829-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f90f/12293163/eda8b36bc43c/diagnostics-15-01829-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f90f/12293163/48fa7457c35f/diagnostics-15-01829-g005.jpg

相似文献

1
Train-Time and Test-Time Computation in Large Language Models for Error Detection and Correction in Electronic Medical Records: A Retrospective Study.用于电子病历错误检测与纠正的大语言模型中的训练时和测试时计算:一项回顾性研究
Diagnostics (Basel). 2025 Jul 21;15(14):1829. doi: 10.3390/diagnostics15141829.
2
Enhancing Pulmonary Disease Prediction Using Large Language Models With Feature Summarization and Hybrid Retrieval-Augmented Generation: Multicenter Methodological Study Based on Radiology Report.使用具有特征总结和混合检索增强生成功能的大语言模型增强肺部疾病预测:基于放射学报告的多中心方法学研究
J Med Internet Res. 2025 Jun 11;27:e72638. doi: 10.2196/72638.
3
Comparison of Two Modern Survival Prediction Tools, SORG-MLA and METSSS, in Patients With Symptomatic Long-bone Metastases Who Underwent Local Treatment With Surgery Followed by Radiotherapy and With Radiotherapy Alone.两种现代生存预测工具 SORG-MLA 和 METSSS 在接受手术联合放疗和单纯放疗治疗有症状长骨转移患者中的比较。
Clin Orthop Relat Res. 2024 Dec 1;482(12):2193-2208. doi: 10.1097/CORR.0000000000003185. Epub 2024 Jul 23.
4
Falls prevention interventions for community-dwelling older adults: systematic review and meta-analysis of benefits, harms, and patient values and preferences.社区居住的老年人跌倒预防干预措施:系统评价和荟萃分析的益处、危害以及患者的价值观和偏好。
Syst Rev. 2024 Nov 26;13(1):289. doi: 10.1186/s13643-024-02681-3.
5
Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中,如果患者出现以下症状和体征,可判断其是否患有 COVID-19。
Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.
6
PET-CT for assessing mediastinal lymph node involvement in patients with suspected resectable non-small cell lung cancer.正电子发射断层显像-计算机断层扫描用于评估疑似可切除非小细胞肺癌患者的纵隔淋巴结受累情况。
Cochrane Database Syst Rev. 2014 Nov 13;2014(11):CD009519. doi: 10.1002/14651858.CD009519.pub2.
7
Multicriteria Optimization of Language Models for Heart Failure With Preserved Ejection Fraction Symptom Detection in Spanish Electronic Health Records: Comparative Modeling Study.西班牙电子健康记录中射血分数保留的心力衰竭症状检测语言模型的多标准优化:比较建模研究
J Med Internet Res. 2025 Jul 17;27:e76433. doi: 10.2196/76433.
8
Cost-effectiveness of using prognostic information to select women with breast cancer for adjuvant systemic therapy.利用预后信息为乳腺癌患者选择辅助性全身治疗的成本效益
Health Technol Assess. 2006 Sep;10(34):iii-iv, ix-xi, 1-204. doi: 10.3310/hta10340.
9
Artificial intelligence for detecting keratoconus.人工智能在圆锥角膜检测中的应用。
Cochrane Database Syst Rev. 2023 Nov 15;11(11):CD014911. doi: 10.1002/14651858.CD014911.pub2.
10
Knowledge Graph-Enhanced Deep Learning Model (H-SYSTEM) for Hypertensive Intracerebral Hemorrhage: Model Development and Validation.用于高血压性脑出血的知识图谱增强深度学习模型(H-SYSTEM):模型开发与验证
J Med Internet Res. 2025 Jun 12;27:e66055. doi: 10.2196/66055.

本文引用的文献

1
The Development Landscape of Large Language Models for Biomedical Applications.用于生物医学应用的大语言模型的发展态势
Annu Rev Biomed Data Sci. 2025 Aug;8(1):251-274. doi: 10.1146/annurev-biodatasci-102224-074736. Epub 2025 Apr 1.
2
EHRAgent: Code Empowers Large Language Models for Few-shot Complex Tabular Reasoning on Electronic Health Records.EHRAgent:代码助力大语言模型在电子健康记录上进行少样本复杂表格推理。
Proc Conf Empir Methods Nat Lang Process. 2024 Nov;2024:22315-22339. doi: 10.18653/v1/2024.emnlp-main.1245.
3
Large language models for data extraction from unstructured and semi-structured electronic health records: a multiple model performance evaluation.
用于从非结构化和半结构化电子健康记录中提取数据的大语言模型:多模型性能评估
BMJ Health Care Inform. 2025 Jan 19;32(1):e101139. doi: 10.1136/bmjhci-2024-101139.
4
The potential of Generative Pre-trained Transformer 4 (GPT-4) to analyse medical notes in three different languages: a retrospective model-evaluation study.生成式预训练变换器4(GPT-4)分析三种不同语言医学笔记的潜力:一项回顾性模型评估研究。
Lancet Digit Health. 2025 Jan;7(1):e35-e43. doi: 10.1016/S2589-7500(24)00246-2.
5
ChatGPT for digital pathology research.ChatGPT 在数字病理学研究中的应用。
Lancet Digit Health. 2024 Aug;6(8):e595-e600. doi: 10.1016/S2589-7500(24)00114-6. Epub 2024 Jul 9.
6
Validation of GPT-4 for clinical event classification: A comparative analysis with ICD codes and human reviewers.GPT-4 在临床事件分类中的验证:与 ICD 编码和人工审核员的比较分析。
J Gastroenterol Hepatol. 2024 Aug;39(8):1535-1543. doi: 10.1111/jgh.16561. Epub 2024 Apr 16.
7
Potential of GPT-4 for Detecting Errors in Radiology Reports: Implications for Reporting Accuracy.GPT-4 在检测放射科报告错误方面的潜力:对报告准确性的影响。
Radiology. 2024 Apr;311(1):e232714. doi: 10.1148/radiol.232714.
8
Evaluating the ChatGPT family of models for biomedical reasoning and classification.评估ChatGPT系列模型在生物医学推理和分类方面的表现。
J Am Med Inform Assoc. 2024 Apr 3;31(4):940-948. doi: 10.1093/jamia/ocad256.
9
[Application of Medical Record Quality Control System Based on Artificial Intelligence].基于人工智能的病历质量控制系统的应用
Sichuan Da Xue Xue Bao Yi Xue Ban. 2023 Nov 20;54(6):1263-1268. doi: 10.12182/20231160206.
10
A large language model for electronic health records.用于电子健康记录的大型语言模型。
NPJ Digit Med. 2022 Dec 26;5(1):194. doi: 10.1038/s41746-022-00742-2.