Cai Qiong, Yang Lanting, Xiao Jiangping, Ma Jiale, Liu Molei, Pan Xilong
Department of Social Medicine and Health Education, School of Public Health, Peking University, Beijing 100191, China.
Department of Biostatistics, Peking University Health Science Center, Beijing 100191, China.
Diagnostics (Basel). 2025 Jul 21;15(14):1829. doi: 10.3390/diagnostics15141829.
This study examines the effectiveness of train-time computation, test-time computation, and their combination on the performance of large language modeling applied to an electronic medical record quality management system. It identifies the most effective combination of models to enhance clinical documentation performance and efficiency. A total of 597 clinical medical records were selected from the MEDEC-MS dataset, 10 of which were used for prompt engineering to guide model training. Eight large language models were employed for training, focusing on train-time computation and test-time computation. Model performance on specific error types was assessed using precision, recall, F1 score, and error correction accuracy. The dataset was divided into training and testing sets in a 7:3 ratio. The assembly model was created using binary logistic regression for assembly analysis of the top-performing models. Its performance was evaluated using area under the curve values and model weights. GPT-4 and Deepseek R1 demonstrated higher overall accuracy in detecting errors. Models that focus on train-time computation exhibited shorter reasoning times and stricter error detection, while models emphasizing test-time computation achieved higher error correction accuracy. The GPT-4 model was particularly effective in addressing issues related to causal organisms, management, and pharmacotherapy, whereas models focusing on test-time computation performed better in tasks involving diagnosis and treatment. The assembly model, focusing on both train-time computation and test-time computation, outperformed any single large language model (Assembly model accuracy: 0.690 vs. GPT-4 accuracy: 0.477). Models focusing on train-time computation demonstrated greater efficiency in processing speed, while models focusing on test-time computation showed higher accuracy and interpretability in identifying and detecting quality issues in electronic medical records. Assembling the train-time and test-time computation strategies may strike a balance between high accuracy and model efficiency, thereby enhancing the development of electronic medical records and improving medical care.
本研究考察了训练时计算、测试时计算及其组合对应用于电子病历质量管理系统的大语言模型性能的有效性。它确定了提高临床文档性能和效率的最有效模型组合。从MEDEC-MS数据集中总共选取了597份临床病历,其中10份用于提示工程以指导模型训练。使用了八个大语言模型进行训练,重点关注训练时计算和测试时计算。使用精确率、召回率、F1分数和纠错准确率评估模型在特定错误类型上的性能。数据集以7:3的比例分为训练集和测试集。使用二元逻辑回归创建组装模型,用于对表现最佳的模型进行组装分析。使用曲线下面积值和模型权重评估其性能。GPT-4和渊思R1在检测错误方面表现出更高的总体准确率。专注于训练时计算的模型推理时间更短,错误检测更严格,而强调测试时计算的模型纠错准确率更高。GPT-4模型在解决与病原体、管理和药物治疗相关的问题方面特别有效,而专注于测试时计算的模型在涉及诊断和治疗的任务中表现更好。同时关注训练时计算和测试时计算的组装模型优于任何单个大语言模型(组装模型准确率:0.690 vs. GPT-4准确率:0.477)。专注于训练时计算的模型在处理速度上表现出更高的效率,而专注于测试时计算的模型在识别和检测电子病历质量问题方面具有更高的准确性和可解释性。结合训练时和测试时的计算策略可能在高精度和模型效率之间取得平衡,从而促进电子病历的发展并改善医疗护理。