• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

LLM-AIx:一种基于隐私保护大语言模型从非结构化医学文本中提取信息的开源管道。

LLM-AIx: An open source pipeline for Information Extraction from unstructured medical text based on privacy preserving Large Language Models.

作者信息

Wiest Isabella Catharina, Wolf Fabian, Leßmann Marie-Elisabeth, van Treeck Marko, Ferber Dyke, Zhu Jiefu, Boehme Heiko, Bressem Keno K, Ulrich Hannes, Ebert Matthias P, Kather Jakob Nikolas

机构信息

Department of Medicine II, Medical Faculty Mannheim, Heidelberg University, Mannheim, Germany.

Else Kroener Fresenius Center for Digital Health, Faculty of Medicine and University Hospital Carl Gustav Carus, TUD Dresden University of Technology, 01307 Dresden, Germany.

出版信息

medRxiv. 2024 Sep 3:2024.09.02.24312917. doi: 10.1101/2024.09.02.24312917.

DOI:10.1101/2024.09.02.24312917
PMID:39281753
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11398444/
Abstract

In clinical science and practice, text data, such as clinical letters or procedure reports, is stored in an unstructured way. This type of data is not a quantifiable resource for any kind of quantitative investigations and any manual review or structured information retrieval is time-consuming and costly. The capabilities of Large Language Models (LLMs) mark a paradigm shift in natural language processing and offer new possibilities for structured Information Extraction (IE) from medical free text. This protocol describes a workflow for LLM based information extraction (LLM-AIx), enabling extraction of predefined entities from unstructured text using privacy preserving LLMs. By converting unstructured clinical text into structured data, LLM-AIx addresses a critical barrier in clinical research and practice, where the efficient extraction of information is essential for improving clinical decision-making, enhancing patient outcomes, and facilitating large-scale data analysis. The protocol consists of four main processing steps: 1) Problem definition and data preparation, 2) data preprocessing, 3) LLM-based IE and 4) output evaluation. LLM-AIx allows integration on local hospital hardware without the need of transferring any patient data to external servers. As example tasks, we applied LLM-AIx for the anonymization of fictitious clinical letters from patients with pulmonary embolism. Additionally, we extracted symptoms and laterality of the pulmonary embolism of these fictitious letters. We demonstrate troubleshooting for potential problems within the pipeline with an IE on a real-world dataset, 100 pathology reports from the Cancer Genome Atlas Program (TCGA), for TNM stage extraction. LLM-AIx can be executed without any programming knowledge via an easy-to-use interface and in no more than a few minutes or hours, depending on the LLM model selected.

摘要

在临床科学与实践中,诸如临床信函或诊疗报告等文本数据以非结构化方式存储。这类数据并非用于任何定量研究的可量化资源,任何人工审阅或结构化信息检索都既耗时又昂贵。大语言模型(LLMs)的能力标志着自然语言处理的范式转变,并为从医学自由文本中进行结构化信息提取(IE)提供了新的可能性。本方案描述了一种基于大语言模型的信息提取工作流程(LLM - AIx),它能够使用隐私保护大语言模型从未结构化文本中提取预定义实体。通过将非结构化临床文本转换为结构化数据,LLM - AIx解决了临床研究与实践中的一个关键障碍,即高效提取信息对于改善临床决策、提升患者治疗效果以及促进大规模数据分析至关重要。该方案包括四个主要处理步骤:1)问题定义与数据准备,2)数据预处理,3)基于大语言模型的信息提取,以及4)输出评估。LLM - AIx允许在本地医院硬件上进行集成,无需将任何患者数据传输到外部服务器。作为示例任务,我们将LLM - AIx应用于对肺栓塞患者虚构临床信函进行匿名化处理。此外,我们提取了这些虚构信函中肺栓塞的症状和部位。我们通过对来自癌症基因组图谱计划(TCGA)的100份病理报告这一真实世界数据集进行信息提取,展示了对流程中潜在问题的故障排除,以提取TNM分期。LLM - AIx可以通过易于使用的界面在无需任何编程知识的情况下执行,根据所选的大语言模型,执行时间不超过几分钟或几小时。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9d25/11398444/fa74d5437e3e/nihpp-2024.09.02.24312917v1-f0027.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9d25/11398444/8713fcdbdbbb/nihpp-2024.09.02.24312917v1-f0023.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9d25/11398444/658562cea799/nihpp-2024.09.02.24312917v1-f0024.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9d25/11398444/dba1bce2ee44/nihpp-2024.09.02.24312917v1-f0025.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9d25/11398444/007ff73d5836/nihpp-2024.09.02.24312917v1-f0026.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9d25/11398444/fa74d5437e3e/nihpp-2024.09.02.24312917v1-f0027.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9d25/11398444/8713fcdbdbbb/nihpp-2024.09.02.24312917v1-f0023.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9d25/11398444/658562cea799/nihpp-2024.09.02.24312917v1-f0024.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9d25/11398444/dba1bce2ee44/nihpp-2024.09.02.24312917v1-f0025.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9d25/11398444/007ff73d5836/nihpp-2024.09.02.24312917v1-f0026.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9d25/11398444/fa74d5437e3e/nihpp-2024.09.02.24312917v1-f0027.jpg

相似文献

1
LLM-AIx: An open source pipeline for Information Extraction from unstructured medical text based on privacy preserving Large Language Models.LLM-AIx:一种基于隐私保护大语言模型从非结构化医学文本中提取信息的开源管道。
medRxiv. 2024 Sep 3:2024.09.02.24312917. doi: 10.1101/2024.09.02.24312917.
2
Optimizing Data Extraction: Harnessing RAG and LLMs for German Medical Documents.优化数据提取:利用 RAG 和大型语言模型处理德语文献
Stud Health Technol Inform. 2024 Aug 22;316:949-950. doi: 10.3233/SHTI240567.
3
Leveraging Large Language Models for Precision Monitoring of Chemotherapy-Induced Toxicities: A Pilot Study with Expert Comparisons and Future Directions.利用大语言模型进行化疗诱导毒性的精准监测:一项专家比较及未来方向的试点研究
Cancers (Basel). 2024 Aug 12;16(16):2830. doi: 10.3390/cancers16162830.
4
Automated anonymization of radiology reports: comparison of publicly available natural language processing and large language models.放射学报告的自动匿名化:公开可用的自然语言处理与大语言模型的比较
Eur Radiol. 2025 May;35(5):2634-2641. doi: 10.1007/s00330-024-11148-x. Epub 2024 Oct 31.
5
Potential of Large Language Models in Health Care: Delphi Study.大语言模型在医疗保健中的潜力:德尔菲研究。
J Med Internet Res. 2024 May 13;26:e52399. doi: 10.2196/52399.
6
Programming techniques for improving rule readability for rule-based information extraction natural language processing pipelines of unstructured and semi-structured medical texts.用于改进基于规则的信息抽取自然语言处理管道的规则可读性的编程技术,这些管道处理非结构化和半结构化的医学文本。
Health Informatics J. 2023 Apr-Jun;29(2):14604582231164696. doi: 10.1177/14604582231164696.
7
Automatic structuring of radiology reports with on-premise open-source large language models.使用本地开源大语言模型对放射学报告进行自动结构化处理。
Eur Radiol. 2025 Apr;35(4):2018-2029. doi: 10.1007/s00330-024-11074-y. Epub 2024 Oct 10.
8
Privacy-preserving large language models for structured medical information retrieval.用于结构化医学信息检索的隐私保护大语言模型
NPJ Digit Med. 2024 Sep 20;7(1):257. doi: 10.1038/s41746-024-01233-2.
9
An Entity Extraction Pipeline for Medical Text Records Using Large Language Models: Analytical Study.基于大型语言模型的医疗文本记录实体抽取流水线:分析研究。
J Med Internet Res. 2024 Mar 29;26:e54580. doi: 10.2196/54580.
10
Large language model-based information extraction from free-text radiology reports: a scoping review protocol.基于大型语言模型的自由文本放射学报告信息提取:范围综述方案。
BMJ Open. 2023 Dec 9;13(12):e076865. doi: 10.1136/bmjopen-2023-076865.

本文引用的文献

1
In-context learning enables multimodal large language models to classify cancer pathology images.语境学习使多模态大型语言模型能够对癌症病理学图像进行分类。
Nat Commun. 2024 Nov 21;15(1):10104. doi: 10.1038/s41467-024-51465-9.
2
Detection of suicidality from medical text using privacy-preserving large language models.使用隐私保护大语言模型从医学文本中检测自杀倾向。
Br J Psychiatry. 2024 Dec;225(6):532-537. doi: 10.1192/bjp.2024.134.
3
Applying generative AI with retrieval augmented generation to summarize and extract key clinical information from electronic health records.
运用生成式人工智能与检索增强生成相结合,从电子健康记录中总结和提取关键临床信息。
J Biomed Inform. 2024 Aug;156:104662. doi: 10.1016/j.jbi.2024.104662. Epub 2024 Jun 14.
4
A guide to artificial intelligence for cancer researchers.癌症研究人员的人工智能指南。
Nat Rev Cancer. 2024 Jun;24(6):427-441. doi: 10.1038/s41568-024-00694-7. Epub 2024 May 16.
5
Author Correction: Analysis and visualisation of electronic health records data to identify undiagnosed patients with rare genetic diseases.作者更正:电子健康记录数据的分析与可视化,以识别未确诊的罕见遗传病患者。
Sci Rep. 2024 May 2;14(1):10084. doi: 10.1038/s41598-024-60776-2.
6
Structured information extraction from scientific text with large language models.利用大语言模型从科学文本中提取结构化信息。
Nat Commun. 2024 Feb 15;15(1):1418. doi: 10.1038/s41467-024-45563-x.
7
Additional Value From Free-Text Diagnoses in Electronic Health Records: Hybrid Dictionary and Machine Learning Classification Study.电子健康记录中自由文本诊断的附加价值:混合词典与机器学习分类研究
JMIR Med Inform. 2024 Jan 17;12:e49007. doi: 10.2196/49007.
8
Extracting structured information from unstructured histopathology reports using generative pre-trained transformer 4 (GPT-4).使用生成式预训练转换器 4(GPT-4)从非结构化组织病理学报告中提取结构化信息。
J Pathol. 2024 Mar;262(3):310-319. doi: 10.1002/path.6232. Epub 2023 Dec 14.
9
ChatGPT outperforms crowd workers for text-annotation tasks.在文本注释任务中,ChatGPT的表现优于众包工作者。
Proc Natl Acad Sci U S A. 2023 Jul 25;120(30):e2305016120. doi: 10.1073/pnas.2305016120. Epub 2023 Jul 18.
10
Extracting Medical Information From Free-Text and Unstructured Patient-Generated Health Data Using Natural Language Processing Methods: Feasibility Study With Real-world Data.使用自然语言处理方法从自由文本和非结构化患者生成的健康数据中提取医学信息:基于真实世界数据的可行性研究
JMIR Form Res. 2023 Mar 7;7:e43014. doi: 10.2196/43014.