• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

通过对风险的批判性评估,在大语言模型创新中实现强大的隐私保护。

Robust privacy amidst innovation with large language models through a critical assessment of the risks.

作者信息

Chuang Yao-Shun, Sarkar Atiquer Rahman, Hsu Yu-Chun, Mohammed Noman, Jiang Xiaoqian

机构信息

McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, United States.

Department of Computer Science, University of Manitoba, Winnipeg, Manitoba R3T 5V6, Canada.

出版信息

J Am Med Inform Assoc. 2025 May 1;32(5):885-892. doi: 10.1093/jamia/ocaf037.

DOI:10.1093/jamia/ocaf037
PMID:40112189
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12012348/
Abstract

OBJECTIVE

This study evaluates the integration of electronic health records (EHRs) and natural language processing (NLP) with large language models (LLMs) to enhance healthcare data management and patient care, focusing on using advanced language models to create secure, Health Insurance Portability and Accountability Act-compliant synthetic patient notes for global biomedical research.

MATERIALS AND METHODS

The study used de-identified and re-identified versions of the MIMIC III dataset with GPT-3.5, GPT-4, and Mistral 7B to generate synthetic clinical notes. Text generation employed templates and keyword extraction for contextually relevant notes, with One-shot generation for comparison. Privacy was assessed by analyzing protected health information (PHI) occurrence and co-occurrence, while utility was evaluated by training an ICD-9 coder using synthetic notes. Text quality was measured using ROUGE (Recall-Oriented Understudy for Gisting Evaluation) and cosine similarity metrics to compare synthetic notes with source notes for semantic similarity.

RESULTS

The analysis of PHI occurrence and text utility via the ICD-9 coding task showed that the keyword-based method had low risk and good performance. One-shot generation exhibited the highest PHI exposure and PHI co-occurrence, particularly in geographic location and date categories. The Normalized One-shot method achieved the highest classification accuracy. Re-identified data consistently outperformed de-identified data.

DISCUSSION

Privacy analysis revealed a critical balance between data utility and privacy protection, influencing future data use and sharing.

CONCLUSION

This study shows that keyword-based methods can create synthetic clinical notes that protect privacy while retaining data usability, potentially improving clinical data sharing. The use of dummy PHIs to counter privacy attacks may offer better utility and privacy than traditional de-identification.

摘要

目的

本研究评估电子健康记录(EHR)与自然语言处理(NLP)与大语言模型(LLM)的整合,以加强医疗数据管理和患者护理,重点是使用先进的语言模型创建符合《健康保险流通与责任法案》的安全合成患者记录,用于全球生物医学研究。

材料与方法

本研究使用了去识别化和重新识别化版本的MIMIC III数据集,结合GPT-3.5、GPT-4和米斯特拉尔7B来生成合成临床记录。文本生成采用模板和关键词提取以生成上下文相关的记录,并采用一次性生成进行比较。通过分析受保护健康信息(PHI)的出现和共现情况来评估隐私性,同时通过使用合成记录训练ICD-9编码员来评估实用性。使用ROUGE(面向召回率的摘要评估替代指标)和余弦相似度指标来衡量文本质量,以比较合成记录与源记录的语义相似度。

结果

通过ICD-9编码任务对PHI出现情况和文本实用性的分析表明,基于关键词的方法风险较低且性能良好。一次性生成表现出最高的PHI暴露和PHI共现,特别是在地理位置和日期类别中。归一化一次性方法实现了最高的分类准确率。重新识别化的数据始终优于去识别化的数据。

讨论

隐私分析揭示了数据实用性和隐私保护之间的关键平衡,这会影响未来的数据使用和共享。

结论

本研究表明,基于关键词的方法可以创建既能保护隐私又能保留数据可用性的合成临床记录,可能会改善临床数据共享。使用虚拟PHI来对抗隐私攻击可能比传统的去识别化提供更好的实用性和隐私性。

相似文献

1
Robust privacy amidst innovation with large language models through a critical assessment of the risks.通过对风险的批判性评估,在大语言模型创新中实现强大的隐私保护。
J Am Med Inform Assoc. 2025 May 1;32(5):885-892. doi: 10.1093/jamia/ocaf037.
2
Utilizing large language models for detecting hospital-acquired conditions: an empirical study on pulmonary embolism.利用大语言模型检测医院获得性疾病:关于肺栓塞的实证研究
J Am Med Inform Assoc. 2025 May 1;32(5):876-884. doi: 10.1093/jamia/ocaf048.
3
Harnessing Moderate-Sized Language Models for Reliable Patient Data Deidentification in Emergency Department Records: Algorithm Development, Validation, and Implementation Study.利用中等规模语言模型对急诊科记录中的患者数据进行可靠去识别:算法开发、验证与实施研究。
JMIR AI. 2025 Apr 1;4:e57828. doi: 10.2196/57828.
4
A dataset and benchmark for hospital course summarization with adapted large language models.一个用于医院病程总结的数据集和基准测试,采用了适配的大语言模型。
J Am Med Inform Assoc. 2025 Mar 1;32(3):470-479. doi: 10.1093/jamia/ocae312.
5
Synthetic neurosurgical data generation with generative adversarial networks and large language models:an investigation on fidelity, utility, and privacy.使用生成对抗网络和大语言模型生成合成神经外科数据:关于保真度、实用性和隐私性的调查
Neurosurg Focus. 2025 Jul 1;59(1):E17. doi: 10.3171/2025.4.FOCUS25225.
6
Extracting International Classification of Diseases Codes from Clinical Documentation Using Large Language Models.使用大语言模型从临床文档中提取国际疾病分类代码
Appl Clin Inform. 2025 Mar;16(2):337-344. doi: 10.1055/a-2491-3872. Epub 2024 Nov 28.
7
Extraction of sleep information from clinical notes of Alzheimer's disease patients using natural language processing.使用自然语言处理从阿尔茨海默病患者的临床记录中提取睡眠信息。
J Am Med Inform Assoc. 2024 Oct 1;31(10):2217-2227. doi: 10.1093/jamia/ocae177.
8
Pursuit of Digital Innovation in Psychiatric Data Handling Practices in Ireland: Comprehensive Case Study.爱尔兰精神科数据处理实践中的数字创新探索:综合案例研究
JMIR Hum Factors. 2025 Jun 24;12:e64919. doi: 10.2196/64919.
9
Assessing large language models for acute heart failure classification and information extraction from French clinical notes.评估大型语言模型用于急性心力衰竭分类及从法国临床记录中提取信息。
Comput Biol Med. 2025 Sep;195:110609. doi: 10.1016/j.compbiomed.2025.110609. Epub 2025 Jun 19.
10
Toward Cross-Hospital Deployment of Natural Language Processing Systems: Model Development and Validation of Fine-Tuned Large Language Models for Disease Name Recognition in Japanese.迈向自然语言处理系统的跨医院部署:用于日语疾病名称识别的微调大语言模型的模型开发与验证
JMIR Med Inform. 2025 Jul 8;13:e76773. doi: 10.2196/76773.

引用本文的文献

1
Not Fully Synthetic: LLM-based Hybrid Approaches Towards Privacy-Preserving Clinical Note Sharing.非完全合成:基于大语言模型的隐私保护临床笔记共享混合方法。
AMIA Jt Summits Transl Sci Proc. 2025 Jun 10;2025:441-450. eCollection 2025.

本文引用的文献

1
Assessing the research landscape and clinical utility of large language models: a scoping review.评估大型语言模型的研究现状和临床实用性:范围综述。
BMC Med Inform Decis Mak. 2024 Mar 12;24(1):72. doi: 10.1186/s12911-024-02459-6.
2
A study of generative large language model for medical research and healthcare.一项关于用于医学研究和医疗保健的生成式大语言模型的研究。
NPJ Digit Med. 2023 Nov 16;6(1):210. doi: 10.1038/s41746-023-00958-w.
3
We are not ready yet: limitations of state-of-the-art disease named entity recognizers.我们还没有准备好:最先进的疾病命名实体识别器的局限性。
J Biomed Semantics. 2022 Oct 27;13(1):26. doi: 10.1186/s13326-022-00280-6.
4
Using electronic health records to streamline provider recruitment for implementation science studies.利用电子健康记录简化实施科学研究的提供者招募工作。
PLoS One. 2022 May 13;17(5):e0267915. doi: 10.1371/journal.pone.0267915. eCollection 2022.
5
Challenges in replicating secondary analysis of electronic health records data with multiple computable phenotypes: A case study on methicillin-resistant Staphylococcus aureus bacteremia infections.电子健康记录数据的多重可计算表型二次分析中的挑战:以耐甲氧西林金黄色葡萄球菌菌血症感染为例的研究。
Int J Med Inform. 2021 Sep;153:104531. doi: 10.1016/j.ijmedinf.2021.104531. Epub 2021 Jul 16.
6
Natural language processing with deep learning for medical adverse event detection from free-text medical narratives: A case study of detecting total hip replacement dislocation.基于深度学习的自然语言处理在从自由文本医疗叙事中检测医疗不良事件中的应用:以检测全髋关节置换脱位为例。
Comput Biol Med. 2021 Feb;129:104140. doi: 10.1016/j.compbiomed.2020.104140. Epub 2020 Nov 24.
7
De-identification of electronic health record using neural network.使用神经网络对电子健康记录进行去识别化。
Sci Rep. 2020 Oct 29;10(1):18600. doi: 10.1038/s41598-020-75544-1.
8
Privacy, confidentiality, security and patient safety concerns about electronic health records.电子健康记录中的隐私、保密、安全和患者安全问题。
Int Nurs Rev. 2020 Jun;67(2):218-230. doi: 10.1111/inr.12585. Epub 2020 Apr 21.
9
A study of deep learning methods for de-identification of clinical notes in cross-institute settings.深度学习方法在跨机构环境下对临床记录进行去识别的研究。
BMC Med Inform Decis Mak. 2019 Dec 5;19(Suppl 5):232. doi: 10.1186/s12911-019-0935-4.
10
BioBERT: a pre-trained biomedical language representation model for biomedical text mining.BioBERT:一种用于生物医学文本挖掘的预训练生物医学语言表示模型。
Bioinformatics. 2020 Feb 15;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682.