• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

使用大语言模型从自由文本电子健康记录中进行可扩展的信息提取。

Scalable information extraction from free text electronic health records using large language models.

作者信息

Gu Bowen, Shao Vivian, Liao Ziqian, Carducci Valentina, Brufau Santiago Romero, Yang Jie, Desai Rishi J

机构信息

Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, 1620 Tremont Street, Suite 3030-R, Boston, MA, 02120, USA.

Department of Otorhinolaryngology - Head & Neck Surgery, Mayo Clinic, Rochester, MN, USA.

出版信息

BMC Med Res Methodol. 2025 Jan 28;25(1):23. doi: 10.1186/s12874-025-02470-z.

DOI:10.1186/s12874-025-02470-z
PMID:39871166
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11773977/
Abstract

BACKGROUND

A vast amount of potentially useful information such as description of patient symptoms, family, and social history is recorded as free-text notes in electronic health records (EHRs) but is difficult to reliably extract at scale, limiting their utility in research. This study aims to assess whether an "out of the box" implementation of open-source large language models (LLMs) without any fine-tuning can accurately extract social determinants of health (SDoH) data from free-text clinical notes.

METHODS

We conducted a cross-sectional study using EHR data from the Mass General Brigham (MGB) system, analyzing free-text notes for SDoH information. We selected a random sample of 200 patients and manually labeled nine SDoH aspects. Eight advanced open-source LLMs were evaluated against a baseline pattern-matching model. Two human reviewers provided the manual labels, achieving 93% inter-annotator agreement. LLM performance was assessed using accuracy metrics for overall, mentioned, and non-mentioned SDoH, and macro F1 scores.

RESULTS

LLMs outperformed the baseline pattern-matching approach, particularly for explicitly mentioned SDoH, achieving up to 40% higher Accuracy. openchat_3.5 was the best-performing model, surpassing the baseline in overall accuracy across all nine SDoH aspects. The refined pipeline with prompt engineering reduced hallucinations and improved accuracy.

CONCLUSIONS

Open-source LLMs are effective and scalable tools for extracting SDoH from unstructured EHRs, surpassing traditional pattern-matching methods. Further refinement and domain-specific training could enhance their utility in clinical research and predictive analytics, improving healthcare outcomes and addressing health disparities.

摘要

背景

大量潜在有用信息,如患者症状描述、家族史和社会史等,都以自由文本注释的形式记录在电子健康记录(EHR)中,但难以大规模可靠提取,限制了它们在研究中的效用。本研究旨在评估无需任何微调的开源大语言模型(LLM)的“开箱即用”实施能否从自由文本临床注释中准确提取健康的社会决定因素(SDoH)数据。

方法

我们使用来自麻省总医院布莱根(MGB)系统的EHR数据进行了一项横断面研究,分析自由文本注释中的SDoH信息。我们随机抽取了200名患者的样本,并手动标记了九个SDoH方面。针对基线模式匹配模型评估了八个先进的开源LLM。两名人类审阅者提供手动标记,注释者间一致性达到93%。使用总体、提及和未提及的SDoH的准确性指标以及宏观F1分数评估LLM性能。

结果

LLM的表现优于基线模式匹配方法,特别是对于明确提及的SDoH,准确率提高了40%。openchat_3.5是表现最佳的模型,在所有九个SDoH方面的总体准确率超过了基线。带有提示工程的优化管道减少了幻觉并提高了准确性。

结论

开源LLM是从非结构化EHR中提取SDoH的有效且可扩展的工具,优于传统的模式匹配方法。进一步的优化和特定领域训练可以提高它们在临床研究和预测分析中的效用,改善医疗保健结果并解决健康差异问题。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/187b/11773977/e33dbc65bf34/12874_2025_2470_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/187b/11773977/52596ada0527/12874_2025_2470_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/187b/11773977/bfa1a72056a0/12874_2025_2470_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/187b/11773977/57a2e3cc7c8f/12874_2025_2470_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/187b/11773977/e33dbc65bf34/12874_2025_2470_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/187b/11773977/52596ada0527/12874_2025_2470_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/187b/11773977/bfa1a72056a0/12874_2025_2470_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/187b/11773977/57a2e3cc7c8f/12874_2025_2470_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/187b/11773977/e33dbc65bf34/12874_2025_2470_Fig4_HTML.jpg

相似文献

1
Scalable information extraction from free text electronic health records using large language models.使用大语言模型从自由文本电子健康记录中进行可扩展的信息提取。
BMC Med Res Methodol. 2025 Jan 28;25(1):23. doi: 10.1186/s12874-025-02470-z.
2
Utilizing large language models for detecting hospital-acquired conditions: an empirical study on pulmonary embolism.利用大语言模型检测医院获得性疾病:关于肺栓塞的实证研究
J Am Med Inform Assoc. 2025 May 1;32(5):876-884. doi: 10.1093/jamia/ocaf048.
3
A dataset and benchmark for hospital course summarization with adapted large language models.一个用于医院病程总结的数据集和基准测试,采用了适配的大语言模型。
J Am Med Inform Assoc. 2025 Mar 1;32(3):470-479. doi: 10.1093/jamia/ocae312.
4
Language Models for Multilabel Document Classification of Surgical Concepts in Exploratory Laparotomy Operative Notes: Algorithm Development Study.用于探索性剖腹手术记录中手术概念多标签文档分类的语言模型:算法开发研究
JMIR Med Inform. 2025 Jul 9;13:e71176. doi: 10.2196/71176.
5
Are Detailed, Patient-level Social Determinant of Health Factors Associated With Physical Function and Mental Health at Presentation Among New Patients With Orthopaedic Conditions?详细的患者层面的健康社会决定因素是否与新骨科患者就诊时的身体功能和心理健康相关?
Clin Orthop Relat Res. 2023 May 1;481(5):912-921. doi: 10.1097/CORR.0000000000002446. Epub 2022 Oct 6.
6
A large language model based pipeline for extracting information from patient complaint and anamnesis in clinical notes for severity assessment.一种基于大语言模型的管道,用于从临床记录中的患者主诉和病史中提取信息以进行严重程度评估。
Sci Rep. 2025 Jul 14;15(1):25345. doi: 10.1038/s41598-025-07649-4.
7
Toward Cross-Hospital Deployment of Natural Language Processing Systems: Model Development and Validation of Fine-Tuned Large Language Models for Disease Name Recognition in Japanese.迈向自然语言处理系统的跨医院部署:用于日语疾病名称识别的微调大语言模型的模型开发与验证
JMIR Med Inform. 2025 Jul 8;13:e76773. doi: 10.2196/76773.
8
Extracting epilepsy-related information from unstructured clinic letters using large language models.使用大语言模型从非结构化临床信件中提取癫痫相关信息。
Epilepsia. 2025 Jul 10. doi: 10.1111/epi.18475.
9
Extracting social support and social isolation information from clinical psychiatry notes: comparing a rule-based natural language processing system and a large language model.从临床精神病学笔记中提取社会支持和社会隔离信息:比较基于规则的自然语言处理系统和大语言模型
J Am Med Inform Assoc. 2025 Jan 1;32(1):218-226. doi: 10.1093/jamia/ocae260.
10
Identification of Long-Term Care Facility Residence From Admission Notes Using Large Language Models.使用大语言模型从入院记录中识别长期护理机构居民
JAMA Netw Open. 2025 May 1;8(5):e2512032. doi: 10.1001/jamanetworkopen.2025.12032.

引用本文的文献

1
The Case for the Pediatric Cardiologist-Informaticist.儿科心脏病专家-信息学家的情况
Pediatr Cardiol. 2025 Aug 26. doi: 10.1007/s00246-025-04001-5.
2
Large Language Models for Psychiatric Phenotype Extraction from Electronic Health Records.用于从电子健康记录中提取精神疾病表型的大语言模型
medRxiv. 2025 Aug 12:2025.08.07.25333172. doi: 10.1101/2025.08.07.25333172.
3
Leveraging GPT-4o for Automated Extraction of Neural Projections from Scientific Literature.利用GPT-4o从科学文献中自动提取神经投射

本文引用的文献

1
Scale invariance in early embryonic development.早期胚胎发育中的尺度不变性。
ArXiv. 2023 Dec 29:arXiv:2312.17684v1.
2
Large language models to identify social determinants of health in electronic health records.利用大语言模型识别电子健康记录中的健康社会决定因素。
NPJ Digit Med. 2024 Jan 11;7(1):6. doi: 10.1038/s41746-023-00970-0.
3
High-resolution myelin-water fraction and quantitative relaxation mapping using 3D ViSTa-MR fingerprinting.使用3D ViSTa-MR指纹识别技术进行高分辨率髓磷脂水分数和定量弛豫映射。
AMIA Jt Summits Transl Sci Proc. 2025 Jun 10;2025:32-41. eCollection 2025.
ArXiv. 2023 Dec 21:arXiv:2312.13523v1.
4
Genetic InfoMax: Exploring Mutual Information Maximization in High-Dimensional Imaging Genetics Studies.遗传信息最大化:探索高维成像遗传学研究中的互信息最大化
ArXiv. 2023 Sep 26:arXiv:2309.15132v1.
5
Large language models encode clinical knowledge.大语言模型编码临床知识。
Nature. 2023 Aug;620(7972):172-180. doi: 10.1038/s41586-023-06291-2. Epub 2023 Jul 12.
6
Information extraction from electronic medical documents: state of the art and future research directions.从电子医疗文档中提取信息:现状与未来研究方向
Knowl Inf Syst. 2023;65(2):463-516. doi: 10.1007/s10115-022-01779-1. Epub 2022 Nov 8.
7
Invited Perspective: The Mysterious Case of Social Determinants of Health.特邀观点:健康的社会决定因素之谜
Environ Health Perspect. 2022 Nov;130(11):111303. doi: 10.1289/EHP12030. Epub 2022 Nov 9.
8
Assessing the Documentation of Social Determinants of Health for Lung Cancer Patients in Clinical Narratives.评估临床病历中肺癌患者健康社会决定因素的记录情况。
Front Public Health. 2022 Mar 28;10:778463. doi: 10.3389/fpubh.2022.778463. eCollection 2022.
9
Assessing the Prognostic Significance of Tumor-Infiltrating Lymphocytes in Patients With Melanoma Using Pathologic Features Identified by Natural Language Processing.利用自然语言处理识别的病理特征评估黑色素瘤患者肿瘤浸润淋巴细胞的预后意义。
JAMA Netw Open. 2021 Sep 1;4(9):e2126337. doi: 10.1001/jamanetworkopen.2021.26337.
10
Documentation and review of social determinants of health data in the EHR: measures and associated insights.电子健康记录中健康的社会决定因素数据的文档记录和审查:措施和相关见解。
J Am Med Inform Assoc. 2021 Nov 25;28(12):2608-2616. doi: 10.1093/jamia/ocab194.