基于临床文本的大语言模型症状识别：多中心研究。

Large Language Model Symptom Identification From Clinical Text: Multicenter Study.

作者信息

McMurry Andrew J, Phelan Dylan, Dixon Brian E, Geva Alon, Gottlieb Daniel, Jones James R, Terry Michael, Taylor David E, Callaway Hannah, Manoharan Sneha, Miller Timothy, Olson Karen L, Mandl Kenneth D

机构信息

Computational Health Informatics Program, Boston Children's Hospital, 401 Park Drive, LM5506, Mail Stop BCH3187, Boston, MA, 02215, United States, 1 617-355-4145.

Department of Pediatrics, Harvard Medical School, Boston, MA, United States.

出版信息

J Med Internet Res. 2025 Jul 31;27:e72984. doi: 10.2196/72984.

DOI:10.2196/72984

PMID:40743494

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12313083/

Abstract

BACKGROUND

Recognizing patient symptoms is fundamental to medicine, research, and public health. However, symptoms are often underreported in coded formats even though they are routinely documented in physician notes. Large language models (LLMs), noted for their generalizability, could help bridge this gap by mimicking the role of human expert chart reviewers for symptom identification.

OBJECTIVE

The primary objective of this multisite study was to measure the accurate identification of infectious respiratory disease symptoms using LLMs instructed to follow chart review guidelines. The secondary objective was to evaluate LLM generalizability in multisite settings without the need for site-specific training, fine-tuning, or customization.

METHODS

Four LLMs were evaluated: GPT-4, GPT-3.5, Llama2 70B, and Mixtral 8×7B. LLM prompts were instructed to take on the role of chart reviewers and follow symptom annotation guidelines when assessing physician notes. Ground truth labels for each note were annotated by subject matter experts. Optimal LLM prompting strategies were selected using a development corpus of 103 notes from the emergency department at Boston Children's Hospital. The performance of each LLM was measured using a test corpus with 202 notes from Boston Children's Hospital. The performance of an International Classification of Diseases, Tenth Revision (ICD-10)-based method was also measured as a baseline. Generalizability of the most performant LLM was then measured in a validation corpus of 308 notes from 21 emergency departments in the Indiana Health Information Exchange.

RESULTS

Symptom identification accuracy was superior for every LLM tested for each infectious disease symptom compared to an ICD-10-based method (F1-score=45.1%). GPT-4 was the highest scoring (F1-score=91.4%; P<.001) and was significantly better than the ICD-10-based method, followed by GPT-3.5 (F1-score=90.0%; P<.001), Llama2 (F1-score=81.7%; P<.001), and Mixtral (F1-score=83.5%; P<.001). For the validation corpus, performance of the ICD-10-based method decreased (F1-score=26.9%), while GPT-4 increased (F1-score=94.0%), demonstrating better generalizability using GPT-4 (P<.001).

CONCLUSIONS

LLMs significantly outperformed an ICD-10-based method for respiratory symptom identification in emergency department electronic health records. GPT-4 demonstrated the highest accuracy and generalizability, suggesting that LLMs may augment or replace traditional approaches. LLMs can be instructed to mimic human chart reviewers with high accuracy. Future work should assess broader symptom types and health care settings.

摘要

背景

识别患者症状是医学、研究和公共卫生的基础。然而，尽管症状在医生记录中经常有记载，但在编码格式中往往报告不足。以通用性著称的大语言模型（LLMs）可以通过模仿人类专家病历审阅者的角色来帮助弥合这一差距，以进行症状识别。

目的

这项多中心研究的主要目的是使用按照病历审阅指南进行训练的大语言模型来测量对传染性呼吸道疾病症状的准确识别。次要目的是评估大语言模型在多中心环境中的通用性，而无需进行特定地点的培训、微调或定制。

方法

对四个大语言模型进行了评估：GPT-4、GPT-3.5、Llama2 70B和Mixtral 8×7B。大语言模型提示被设定为扮演病历审阅者的角色，并在评估医生记录时遵循症状注释指南。每个记录的真实标签由主题专家进行注释。使用来自波士顿儿童医院急诊科的103份记录的开发语料库选择最佳的大语言模型提示策略。使用来自波士顿儿童医院的202份记录的测试语料库来测量每个大语言模型的性能。还测量了基于国际疾病分类第十版（ICD-10）的方法的性能作为基线。然后在印第安纳州健康信息交换中心21个急诊科的308份记录的验证语料库中测量性能最佳的大语言模型的通用性。

结果

与基于ICD-10的方法相比，在测试的每个大语言模型中，每种传染病症状的症状识别准确率都更高（F1分数=45.1%）。GPT-4得分最高（F1分数=91.4%；P<0.001），并且明显优于基于ICD-10的方法，其次是GPT-3.5（F1分数=90.0%；P<0.001）、Llama2（F1分数=81.7%；P<0.001）和Mixtral（F1分数=83.5%；P<0.001）。对于验证语料库，基于ICD-10的方法的性能下降（F1分数=26.9%），而GPT-4的性能提高（F1分数=94.0%），这表明使用GPT-4具有更好的通用性（P<0.001）。

结论

在急诊科电子健康记录中，大语言模型在呼吸道症状识别方面明显优于基于ICD-10的方法。GPT-4表现出最高的准确性和通用性，这表明大语言模型可能增强或取代传统方法。可以准确地指示大语言模型模仿人类病历审阅者。未来的工作应该评估更广泛的症状类型和医疗保健环境。

相似文献

Large Language Model Symptom Identification From Clinical Text: Multicenter Study.

J Med Internet Res. 2025 Jul 31;27:e72984. doi: 10.2196/72984.

Prescription of Controlled Substances: Benefits and Risks

Improving Large Language Models' Summarization Accuracy by Adding Highlights to Discharge Notes: Comparative Evaluation.

JMIR Med Inform. 2025 Jul 24;13:e66476. doi: 10.2196/66476.

Use of Large Language Models to Classify Epidemiological Characteristics in Synthetic and Real-World Social Media Posts About Conjunctivitis Outbreaks: Infodemiology Study.

J Med Internet Res. 2025 Jul 2;27:e65226. doi: 10.2196/65226.

Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.

Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.

Enhancing Pulmonary Disease Prediction Using Large Language Models With Feature Summarization and Hybrid Retrieval-Augmented Generation: Multicenter Methodological Study Based on Radiology Report.

J Med Internet Res. 2025 Jun 11;27:e72638. doi: 10.2196/72638.

Utilizing large language models for detecting hospital-acquired conditions: an empirical study on pulmonary embolism.

J Am Med Inform Assoc. 2025 May 1;32(5):876-884. doi: 10.1093/jamia/ocaf048.

Sexual Harassment and Prevention Training

A dataset and benchmark for hospital course summarization with adapted large language models.

J Am Med Inform Assoc. 2025 Mar 1;32(3):470-479. doi: 10.1093/jamia/ocae312.

Developing an ICD-10 Coding Assistant: Pilot Study Using RoBERTa and GPT-4 for Term Extraction and Description-Based Code Selection.

JMIR Form Res. 2025 Feb 11;9:e60095. doi: 10.2196/60095.

引用本文的文献

Identifying undiagnosed high-risk suicidality cases through comorbidity-adjusted risk modeling.

medRxiv. 2025 Jul 23:2025.07.22.25331824. doi: 10.1101/2025.07.22.25331824.

本文引用的文献

Evolution of clinical Health Information Exchanges to population health resources: a case study of the Indiana network for patient care.

BMC Med Inform Decis Mak. 2025 Feb 24;25(1):97. doi: 10.1186/s12911-025-02933-9.

Are ICD codes reliable for observational studies? Assessing coding consistency for data quality.

Digit Health. 2024 Oct 29;10:20552076241297056. doi: 10.1177/20552076241297056. eCollection 2024 Jan-Dec.

ChatGPT-4 extraction of heart failure symptoms and signs from electronic health records.

Prog Cardiovasc Dis. 2024 Nov-Dec;87:44-49. doi: 10.1016/j.pcad.2024.10.010. Epub 2024 Oct 21.

Enhancing Postmarketing Surveillance of Medical Products With Large Language Models.

JAMA Netw Open. 2024 Aug 1;7(8):e2428276. doi: 10.1001/jamanetworkopen.2024.28276.

Cumulus: a federated electronic health record-based learning system powered by Fast Healthcare Interoperability Resources and artificial intelligence.

J Am Med Inform Assoc. 2024 Aug 1;31(8):1638-1647. doi: 10.1093/jamia/ocae130.

Moving Biosurveillance Beyond Coded Data Using AI for Symptom Detection From Physician Notes: Retrospective Cohort Study.

J Med Internet Res. 2024 Apr 4;26:e53367. doi: 10.2196/53367.

Prompt engineering in consistency and reliability with the evidence-based guideline for LLMs.

NPJ Digit Med. 2024 Feb 20;7(1):41. doi: 10.1038/s41746-024-01029-4.

Performance of ChatGPT as an AI-assisted decision support tool in medicine: a proof-of-concept study for interpreting symptoms and management of common cardiac conditions (AMSTELHEART-2).

Acta Cardiol. 2024 May;79(3):358-366. doi: 10.1080/00015385.2024.2303528. Epub 2024 Feb 13.

DRG-LLaMA : tuning LLaMA model to predict diagnosis-related group for hospitalized patients.

NPJ Digit Med. 2024 Jan 22;7(1):16. doi: 10.1038/s41746-023-00989-3.

Benchmarking the symptom-checking capabilities of ChatGPT for a broad range of diseases.

J Am Med Inform Assoc. 2024 Sep 1;31(9):2084-2088. doi: 10.1093/jamia/ocad245.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于临床文本的大语言模型症状识别：多中心研究。

Large Language Model Symptom Identification From Clinical Text: Multicenter Study.

作者信息

机构信息

Computational Health Informatics Program, Boston Children's Hospital, 401 Park Drive, LM5506, Mail Stop BCH3187, Boston, MA, 02215, United States, 1 617-355-4145.

基于临床文本的大语言模型症状识别：多中心研究。

Large Language Model Symptom Identification From Clinical Text: Multicenter Study.

作者信息

机构信息

出版信息

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

背景

目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

基于临床文本的大语言模型症状识别：多中心研究。

Large Language Model Symptom Identification From Clinical Text: Multicenter Study.

作者信息

机构信息

出版信息

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

背景

目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献