评估大型语言模型从临床记录中提取体征和症状的性能。

Evaluation of the Performance of a Large Language Model to Extract Signs and Symptoms from Clinical Notes.

作者信息

Reategui-Rivera C Mahony, Finkelstein Joseph

机构信息

Department of Biomedical Informatics, School of Medicine, University of Utah, Salt Lake City, Utah.

出版信息

Stud Health Technol Inform. 2025 Apr 8;323:71-75. doi: 10.3233/SHTI250051.

DOI:10.3233/SHTI250051

PMID:40200448

Abstract

Large language models (LLMs) have increasingly been used to extract critical information from unstructured clinical notes, which often include important details not captured in the structured sections of electronic health records (EHRs). This study assesses the performance of the GPT-4o LLM in extracting signs and symptoms (S&S) from clinical notes, focusing on both general and organ-specific (urological and cardiorespiratory) contexts. Clinical notes from the MTSamples corpora were manually annotated for comparison with the S&S extraction results using LLM. GPT-4o was applied to extract S&S using named entity recognition techniques. Key performance metrics-precision, recall, and F1-score-were used to evaluate and compare general and organ-specific results. The model showed high precision in general S&S extraction (78%) and achieved the highest precision for organ-specific tasks in the cardiorespiratory dataset (87%). For the urinary dataset, precision was also strong (81%), with balanced recall and F1-scores across analyses. These findings underscore GPT-4o's effectiveness in both general and domain-specific S&S extraction but highlight the need for domain-specific tuning and optimization to further improve recall and generalizability in specialized medical contexts.

摘要

大语言模型（LLMs）越来越多地被用于从非结构化临床记录中提取关键信息，这些记录通常包含电子健康记录（EHRs）结构化部分未涵盖的重要细节。本研究评估了GPT-4o大语言模型在从临床记录中提取体征和症状（S&S）方面的性能，重点关注一般情况和特定器官（泌尿外科和心肺）情况。对MTSamples语料库中的临床记录进行了人工标注，以便与使用大语言模型提取的体征和症状结果进行比较。GPT-4o被应用于使用命名实体识别技术提取体征和症状。关键性能指标——精确率、召回率和F1分数——被用于评估和比较一般情况和特定器官的结果。该模型在一般体征和症状提取方面显示出较高的精确率（78%），在心肺数据集中特定器官任务的精确率最高（87%）。对于泌尿系统数据集，精确率也很高（81%），各分析中的召回率和F1分数较为平衡。这些发现强调了GPT-4o在一般和特定领域体征和症状提取方面的有效性，但也突出了在特定领域进行调整和优化以进一步提高召回率和在专业医疗环境中的通用性的必要性。