• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于临床文本的大语言模型症状识别:多中心研究。

Large Language Model Symptom Identification From Clinical Text: Multicenter Study.

作者信息

McMurry Andrew J, Phelan Dylan, Dixon Brian E, Geva Alon, Gottlieb Daniel, Jones James R, Terry Michael, Taylor David E, Callaway Hannah, Manoharan Sneha, Miller Timothy, Olson Karen L, Mandl Kenneth D

机构信息

Computational Health Informatics Program, Boston Children's Hospital, 401 Park Drive, LM5506, Mail Stop BCH3187, Boston, MA, 02215, United States, 1 617-355-4145.

Department of Pediatrics, Harvard Medical School, Boston, MA, United States.

出版信息

J Med Internet Res. 2025 Jul 31;27:e72984. doi: 10.2196/72984.

DOI:10.2196/72984
PMID:40743494
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12313083/
Abstract

BACKGROUND

Recognizing patient symptoms is fundamental to medicine, research, and public health. However, symptoms are often underreported in coded formats even though they are routinely documented in physician notes. Large language models (LLMs), noted for their generalizability, could help bridge this gap by mimicking the role of human expert chart reviewers for symptom identification.

OBJECTIVE

The primary objective of this multisite study was to measure the accurate identification of infectious respiratory disease symptoms using LLMs instructed to follow chart review guidelines. The secondary objective was to evaluate LLM generalizability in multisite settings without the need for site-specific training, fine-tuning, or customization.

METHODS

Four LLMs were evaluated: GPT-4, GPT-3.5, Llama2 70B, and Mixtral 8×7B. LLM prompts were instructed to take on the role of chart reviewers and follow symptom annotation guidelines when assessing physician notes. Ground truth labels for each note were annotated by subject matter experts. Optimal LLM prompting strategies were selected using a development corpus of 103 notes from the emergency department at Boston Children's Hospital. The performance of each LLM was measured using a test corpus with 202 notes from Boston Children's Hospital. The performance of an International Classification of Diseases, Tenth Revision (ICD-10)-based method was also measured as a baseline. Generalizability of the most performant LLM was then measured in a validation corpus of 308 notes from 21 emergency departments in the Indiana Health Information Exchange.

RESULTS

Symptom identification accuracy was superior for every LLM tested for each infectious disease symptom compared to an ICD-10-based method (F1-score=45.1%). GPT-4 was the highest scoring (F1-score=91.4%; P<.001) and was significantly better than the ICD-10-based method, followed by GPT-3.5 (F1-score=90.0%; P<.001), Llama2 (F1-score=81.7%; P<.001), and Mixtral (F1-score=83.5%; P<.001). For the validation corpus, performance of the ICD-10-based method decreased (F1-score=26.9%), while GPT-4 increased (F1-score=94.0%), demonstrating better generalizability using GPT-4 (P<.001).

CONCLUSIONS

LLMs significantly outperformed an ICD-10-based method for respiratory symptom identification in emergency department electronic health records. GPT-4 demonstrated the highest accuracy and generalizability, suggesting that LLMs may augment or replace traditional approaches. LLMs can be instructed to mimic human chart reviewers with high accuracy. Future work should assess broader symptom types and health care settings.

摘要

背景

识别患者症状是医学、研究和公共卫生的基础。然而,尽管症状在医生记录中经常有记载,但在编码格式中往往报告不足。以通用性著称的大语言模型(LLMs)可以通过模仿人类专家病历审阅者的角色来帮助弥合这一差距,以进行症状识别。

目的

这项多中心研究的主要目的是使用按照病历审阅指南进行训练的大语言模型来测量对传染性呼吸道疾病症状的准确识别。次要目的是评估大语言模型在多中心环境中的通用性,而无需进行特定地点的培训、微调或定制。

方法

对四个大语言模型进行了评估:GPT-4、GPT-3.5、Llama2 70B和Mixtral 8×7B。大语言模型提示被设定为扮演病历审阅者的角色,并在评估医生记录时遵循症状注释指南。每个记录的真实标签由主题专家进行注释。使用来自波士顿儿童医院急诊科的103份记录的开发语料库选择最佳的大语言模型提示策略。使用来自波士顿儿童医院的202份记录的测试语料库来测量每个大语言模型的性能。还测量了基于国际疾病分类第十版(ICD-10)的方法的性能作为基线。然后在印第安纳州健康信息交换中心21个急诊科的308份记录的验证语料库中测量性能最佳的大语言模型的通用性。

结果

与基于ICD-10的方法相比,在测试的每个大语言模型中,每种传染病症状的症状识别准确率都更高(F1分数=45.1%)。GPT-4得分最高(F1分数=91.4%;P<0.001),并且明显优于基于ICD-10的方法,其次是GPT-3.5(F1分数=90.0%;P<0.001)、Llama2(F1分数=81.7%;P<0.001)和Mixtral(F1分数=83.5%;P<0.001)。对于验证语料库,基于ICD-10的方法的性能下降(F1分数=26.9%),而GPT-4的性能提高(F1分数=94.0%),这表明使用GPT-4具有更好的通用性(P<0.001)。

结论

在急诊科电子健康记录中,大语言模型在呼吸道症状识别方面明显优于基于ICD-10的方法。GPT-4表现出最高的准确性和通用性,这表明大语言模型可能增强或取代传统方法。可以准确地指示大语言模型模仿人类病历审阅者。未来的工作应该评估更广泛的症状类型和医疗保健环境。

相似文献

1
Large Language Model Symptom Identification From Clinical Text: Multicenter Study.基于临床文本的大语言模型症状识别:多中心研究。
J Med Internet Res. 2025 Jul 31;27:e72984. doi: 10.2196/72984.
2
Prescription of Controlled Substances: Benefits and Risks管制药品的处方:益处与风险
3
Improving Large Language Models' Summarization Accuracy by Adding Highlights to Discharge Notes: Comparative Evaluation.通过在出院小结中添加重点内容提高大语言模型的总结准确性:比较评估
JMIR Med Inform. 2025 Jul 24;13:e66476. doi: 10.2196/66476.
4
Use of Large Language Models to Classify Epidemiological Characteristics in Synthetic and Real-World Social Media Posts About Conjunctivitis Outbreaks: Infodemiology Study.利用大语言模型对合成及真实世界社交媒体上有关结膜炎爆发的帖子中的流行病学特征进行分类:信息流行病学研究
J Med Internet Res. 2025 Jul 2;27:e65226. doi: 10.2196/65226.
5
Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中,如果患者出现以下症状和体征,可判断其是否患有 COVID-19。
Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.
6
Enhancing Pulmonary Disease Prediction Using Large Language Models With Feature Summarization and Hybrid Retrieval-Augmented Generation: Multicenter Methodological Study Based on Radiology Report.使用具有特征总结和混合检索增强生成功能的大语言模型增强肺部疾病预测:基于放射学报告的多中心方法学研究
J Med Internet Res. 2025 Jun 11;27:e72638. doi: 10.2196/72638.
7
Utilizing large language models for detecting hospital-acquired conditions: an empirical study on pulmonary embolism.利用大语言模型检测医院获得性疾病:关于肺栓塞的实证研究
J Am Med Inform Assoc. 2025 May 1;32(5):876-884. doi: 10.1093/jamia/ocaf048.
8
Sexual Harassment and Prevention Training性骚扰与预防培训
9
A dataset and benchmark for hospital course summarization with adapted large language models.一个用于医院病程总结的数据集和基准测试,采用了适配的大语言模型。
J Am Med Inform Assoc. 2025 Mar 1;32(3):470-479. doi: 10.1093/jamia/ocae312.
10
Developing an ICD-10 Coding Assistant: Pilot Study Using RoBERTa and GPT-4 for Term Extraction and Description-Based Code Selection.开发国际疾病分类第十版(ICD - 10)编码助手:使用RoBERTa和GPT - 4进行术语提取和基于描述的代码选择的试点研究
JMIR Form Res. 2025 Feb 11;9:e60095. doi: 10.2196/60095.

引用本文的文献

1
Identifying undiagnosed high-risk suicidality cases through comorbidity-adjusted risk modeling.通过合并症调整风险模型识别未诊断的高风险自杀倾向病例。
medRxiv. 2025 Jul 23:2025.07.22.25331824. doi: 10.1101/2025.07.22.25331824.

本文引用的文献

1
Evolution of clinical Health Information Exchanges to population health resources: a case study of the Indiana network for patient care.临床健康信息交换向人群健康资源的演变:以印第安纳州患者护理网络为例
BMC Med Inform Decis Mak. 2025 Feb 24;25(1):97. doi: 10.1186/s12911-025-02933-9.
2
Are ICD codes reliable for observational studies? Assessing coding consistency for data quality.国际疾病分类代码用于观察性研究是否可靠?评估数据质量的编码一致性。
Digit Health. 2024 Oct 29;10:20552076241297056. doi: 10.1177/20552076241297056. eCollection 2024 Jan-Dec.
3
ChatGPT-4 extraction of heart failure symptoms and signs from electronic health records.
ChatGPT-4从电子健康记录中提取心力衰竭症状和体征
Prog Cardiovasc Dis. 2024 Nov-Dec;87:44-49. doi: 10.1016/j.pcad.2024.10.010. Epub 2024 Oct 21.
4
Enhancing Postmarketing Surveillance of Medical Products With Large Language Models.利用大语言模型加强上市后医疗产品监测
JAMA Netw Open. 2024 Aug 1;7(8):e2428276. doi: 10.1001/jamanetworkopen.2024.28276.
5
Cumulus: a federated electronic health record-based learning system powered by Fast Healthcare Interoperability Resources and artificial intelligence.Cumulus:一个基于联邦电子健康记录的学习系统,由 Fast Healthcare Interoperability Resources 和人工智能提供支持。
J Am Med Inform Assoc. 2024 Aug 1;31(8):1638-1647. doi: 10.1093/jamia/ocae130.
6
Moving Biosurveillance Beyond Coded Data Using AI for Symptom Detection From Physician Notes: Retrospective Cohort Study.利用人工智能从医生笔记中检测症状,推动生物监测超越编码数据:回顾性队列研究。
J Med Internet Res. 2024 Apr 4;26:e53367. doi: 10.2196/53367.
7
Prompt engineering in consistency and reliability with the evidence-based guideline for LLMs.提示工程在与大语言模型基于证据的指南保持一致性和可靠性方面。
NPJ Digit Med. 2024 Feb 20;7(1):41. doi: 10.1038/s41746-024-01029-4.
8
Performance of ChatGPT as an AI-assisted decision support tool in medicine: a proof-of-concept study for interpreting symptoms and management of common cardiac conditions (AMSTELHEART-2).ChatGPT 在医学中作为 AI 辅助决策支持工具的性能:解释常见心脏疾病症状和管理的概念验证研究 (AMSTELHEART-2)。
Acta Cardiol. 2024 May;79(3):358-366. doi: 10.1080/00015385.2024.2303528. Epub 2024 Feb 13.
9
DRG-LLaMA : tuning LLaMA model to predict diagnosis-related group for hospitalized patients.DRG-LLaMA:调整LLaMA模型以预测住院患者的诊断相关分组
NPJ Digit Med. 2024 Jan 22;7(1):16. doi: 10.1038/s41746-023-00989-3.
10
Benchmarking the symptom-checking capabilities of ChatGPT for a broad range of diseases.对标 ChatGPT 在广泛疾病领域的症状自查能力。
J Am Med Inform Assoc. 2024 Sep 1;31(9):2084-2088. doi: 10.1093/jamia/ocad245.