• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于从非结构化医学文本中自动提取数据的生成式人工智能。

Generative artificial intelligence for automated data extraction from unstructured medical text.

作者信息

Dao Nam, Quesada Luisa, Hassan Syed Moin, Campo Monica Iturrioz, Johnson Shelsey, Ghose Suchandra, San José Estépar Raúl, Waxman Aaron, Washko George, Rahaghi Farbod N

机构信息

Division of Pulmonary and Critical Care, Brigham and Women's Hospital, Boston, MA, United States.

Division of Sleep Medicine, Brigham and Women's Hospital, Boston, MA, United States.

出版信息

JAMIA Open. 2025 Sep 4;8(5):ooaf097. doi: 10.1093/jamiaopen/ooaf097. eCollection 2025 Oct.

DOI:10.1093/jamiaopen/ooaf097
PMID:40918939
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12410982/
Abstract

OBJECTIVES

Unstructured data, such as procedure notes, contain valuable medical information that is frequently underutilized due to the labor-intensive nature of data extraction. This study aims to develop a generative artificial intelligence (GenAI) pipeline using an open-source Large Language Model (LLM) with built-in guardrails and a retry mechanism to extract data from unstructured right heart catheterization (RHC) notes while minimizing errors, including hallucinations.

MATERIALS AND METHODS

A total of 220 RHC notes were randomly selected for pipeline development and 200 for validation from the Pulmonary Vascular Disease Registry. The pipeline comprised three main components: the Engineered Preload Framework (EPF), which integrated schemas and instructions; the LLM module, enhanced by reasoning capabilities; and the validation and retry mechanism, which ensured data accuracy through iterative self-correction. A clinical expert manually extracted data from the validation cohort to establish the ground truth. Pipeline performance was evaluated using precision, recall, and F1 score. Additionally, the dataset was stratified into quartiles to assess the pipeline's ability to handle varying levels of data availability.

RESULTS

The pipeline achieved 99.0% precision, 85.0% recall, and a 91.5% F1 score, with an overall accuracy of 90% when evaluated at the note level. The most common error was missed values (5.2%), while hallucinations were the least frequent (<0.01%).

DISCUSSION AND CONCLUSION

This study demonstrates the feasibility of a robust GenAI pipeline for automating structured data extraction from unstructured RHC procedure notes. The approach highlights the potential of LLMs in medical data mining, improving research efficiency and clinical applications.

摘要

目的

诸如手术记录等非结构化数据包含有价值的医学信息,但由于数据提取工作强度大,这些信息常常未得到充分利用。本研究旨在开发一种生成式人工智能(GenAI)管道,使用具有内置防护机制和重试机制的开源大语言模型(LLM),从非结构化的右心导管检查(RHC)记录中提取数据,同时将包括幻觉在内的错误降至最低。

材料与方法

从肺血管疾病登记处随机选择220份RHC记录用于管道开发,200份用于验证。该管道由三个主要组件组成:工程预负荷框架(EPF),它整合了模式和指令;通过推理能力增强的LLM模块;以及验证和重试机制,通过迭代自我校正确保数据准确性。一名临床专家从验证队列中手动提取数据以确定真实情况。使用精确率、召回率和F1分数评估管道性能。此外,将数据集分层为四分位数,以评估管道处理不同数据可用性水平的能力。

结果

该管道在记录层面评估时,精确率达到99.0%,召回率为85.0%,F1分数为91.5%,总体准确率为90%。最常见的错误是遗漏值(5.2%),而幻觉是最不常见的(<0.01%)。

讨论与结论

本研究证明了一个强大的GenAI管道用于从非结构化RHC手术记录中自动提取结构化数据的可行性。该方法突出了大语言模型在医学数据挖掘中的潜力,提高了研究效率和临床应用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/10fa/12410982/1fa6c08e702f/ooaf097f5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/10fa/12410982/3876dc55b366/ooaf097f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/10fa/12410982/bced6bc77a09/ooaf097f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/10fa/12410982/4a5c47187769/ooaf097f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/10fa/12410982/df80ee5aad6a/ooaf097f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/10fa/12410982/1fa6c08e702f/ooaf097f5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/10fa/12410982/3876dc55b366/ooaf097f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/10fa/12410982/bced6bc77a09/ooaf097f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/10fa/12410982/4a5c47187769/ooaf097f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/10fa/12410982/df80ee5aad6a/ooaf097f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/10fa/12410982/1fa6c08e702f/ooaf097f5.jpg

相似文献

1
Generative artificial intelligence for automated data extraction from unstructured medical text.用于从非结构化医学文本中自动提取数据的生成式人工智能。
JAMIA Open. 2025 Sep 4;8(5):ooaf097. doi: 10.1093/jamiaopen/ooaf097. eCollection 2025 Oct.
2
Automated Extraction of Mortality Information From Publicly Available Sources Using Large Language Models: Development and Evaluation Study.使用大语言模型从公开可用来源自动提取死亡率信息:开发与评估研究
J Med Internet Res. 2025 Aug 18;27:e71113. doi: 10.2196/71113.
3
Large Language Model Symptom Identification From Clinical Text: Multicenter Study.基于临床文本的大语言模型症状识别:多中心研究。
J Med Internet Res. 2025 Jul 31;27:e72984. doi: 10.2196/72984.
4
Prescription of Controlled Substances: Benefits and Risks管制药品的处方:益处与风险
5
Detecting Stigmatizing Language in Clinical Notes with Large Language Models for Addiction Care.使用大语言模型在成瘾护理临床记录中检测污名化语言。
medRxiv. 2025 Aug 12:2025.08.08.25333315. doi: 10.1101/2025.08.08.25333315.
6
Utilizing large language models for detecting hospital-acquired conditions: an empirical study on pulmonary embolism.利用大语言模型检测医院获得性疾病:关于肺栓塞的实证研究
J Am Med Inform Assoc. 2025 May 1;32(5):876-884. doi: 10.1093/jamia/ocaf048.
7
Leveraging Retrieval-Augmented Large Language Models for Dietary Recommendations With Traditional Chinese Medicine's Medicine Food Homology: Algorithm Development and Validation.利用检索增强大语言模型结合中医药食同源进行饮食推荐:算法开发与验证
JMIR Med Inform. 2025 Aug 21;13:e75279. doi: 10.2196/75279.
8
A Machine Learning Approach for Identifying People With Neuroinfectious Diseases in Electronic Health Records: Algorithm Development and Validation.一种用于在电子健康记录中识别神经感染性疾病患者的机器学习方法:算法开发与验证
JMIR Med Inform. 2025 Aug 29;13:e63157. doi: 10.2196/63157.
9
Using Generative Artificial Intelligence in Health Economics and Outcomes Research: A Primer on Techniques and Breakthroughs.在卫生经济学与结果研究中使用生成式人工智能:技术与突破入门
Pharmacoecon Open. 2025 Apr 29. doi: 10.1007/s41669-025-00580-4.
10
Performance of Open-Source Large Language Models in Psychiatry: Usability Study Through Comparative Analysis of Non-English Records and English Translations.开源大语言模型在精神病学中的表现:通过非英语记录与英语译文的对比分析进行可用性研究
J Med Internet Res. 2025 Aug 18;27:e69857. doi: 10.2196/69857.

本文引用的文献

1
CORAL: Expert-Curated Oncology Reports to Advance Language Model Inference.CORAL:经专家策划的肿瘤学报告,以推进语言模型推理。
NEJM AI. 2024 Apr;1(4). doi: 10.1056/aidbp2300110. Epub 2024 Mar 13.
2
Leveraging the power of routinely collected ICU data.利用常规收集的重症监护病房数据的力量。
Intensive Care Med. 2025 Jan;51(1):163-166. doi: 10.1007/s00134-024-07745-5. Epub 2024 Dec 11.
3
Vasoreactivity and inhaled treprostinil response in interstitial lung disease pulmonary hypertension.间质性肺疾病相关性肺动脉高压的血管反应性及吸入性曲前列尼尔反应
ERJ Open Res. 2024 Dec 2;10(6). doi: 10.1183/23120541.00201-2024. eCollection 2024 Nov.
4
Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial.大语言模型对诊断推理的影响:一项随机临床试验。
JAMA Netw Open. 2024 Oct 1;7(10):e2440969. doi: 10.1001/jamanetworkopen.2024.40969.
5
A framework for human evaluation of large language models in healthcare derived from literature review.一个源自文献综述的用于医疗保健领域大语言模型人工评估的框架。
NPJ Digit Med. 2024 Sep 28;7(1):258. doi: 10.1038/s41746-024-01258-7.
6
Use of Generative AI to Identify Helmet Status Among Patients With Micromobility-Related Injuries From Unstructured Clinical Notes.利用生成式人工智能从非结构化临床记录中识别与微移动相关损伤患者的头盔使用情况。
JAMA Netw Open. 2024 Aug 1;7(8):e2425981. doi: 10.1001/jamanetworkopen.2024.25981.
7
Machine learning natural language processing for identifying venous thromboembolism: systematic review and meta-analysis.机器学习自然语言处理在识别静脉血栓栓塞症中的应用:系统评价和荟萃分析。
Blood Adv. 2024 Jun 25;8(12):2991-3000. doi: 10.1182/bloodadvances.2023012200.
8
Development and Validation of a Natural Language Processing Model to Identify Low-Risk Pulmonary Embolism in Real Time to Facilitate Safe Outpatient Management.开发并验证一种自然语言处理模型,实时识别低危肺栓塞,以促进安全的门诊管理。
Ann Emerg Med. 2024 Aug;84(2):118-127. doi: 10.1016/j.annemergmed.2024.01.036. Epub 2024 Mar 2.
9
Machine learning and deep learning predictive models for long-term prognosis in patients with chronic obstructive pulmonary disease: a systematic review and meta-analysis.用于慢性阻塞性肺疾病患者长期预后的机器学习和深度学习预测模型:一项系统评价和荟萃分析。
Lancet Digit Health. 2023 Dec;5(12):e872-e881. doi: 10.1016/S2589-7500(23)00177-2.
10
Implications of Mean Pulmonary Arterial Wedge Pressure Trajectories in Pulmonary Arterial Hypertension.肺动脉楔压轨迹变化对肺动脉高压的影响。
Am J Respir Crit Care Med. 2024 Feb 1;209(3):316-324. doi: 10.1164/rccm.202306-1072OC.