Suppr超能文献

通过对风险的批判性评估,在大语言模型创新中实现强大的隐私保护。

Robust privacy amidst innovation with large language models through a critical assessment of the risks.

作者信息

Chuang Yao-Shun, Sarkar Atiquer Rahman, Hsu Yu-Chun, Mohammed Noman, Jiang Xiaoqian

机构信息

McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, United States.

Department of Computer Science, University of Manitoba, Winnipeg, Manitoba R3T 5V6, Canada.

出版信息

J Am Med Inform Assoc. 2025 May 1;32(5):885-892. doi: 10.1093/jamia/ocaf037.

Abstract

OBJECTIVE

This study evaluates the integration of electronic health records (EHRs) and natural language processing (NLP) with large language models (LLMs) to enhance healthcare data management and patient care, focusing on using advanced language models to create secure, Health Insurance Portability and Accountability Act-compliant synthetic patient notes for global biomedical research.

MATERIALS AND METHODS

The study used de-identified and re-identified versions of the MIMIC III dataset with GPT-3.5, GPT-4, and Mistral 7B to generate synthetic clinical notes. Text generation employed templates and keyword extraction for contextually relevant notes, with One-shot generation for comparison. Privacy was assessed by analyzing protected health information (PHI) occurrence and co-occurrence, while utility was evaluated by training an ICD-9 coder using synthetic notes. Text quality was measured using ROUGE (Recall-Oriented Understudy for Gisting Evaluation) and cosine similarity metrics to compare synthetic notes with source notes for semantic similarity.

RESULTS

The analysis of PHI occurrence and text utility via the ICD-9 coding task showed that the keyword-based method had low risk and good performance. One-shot generation exhibited the highest PHI exposure and PHI co-occurrence, particularly in geographic location and date categories. The Normalized One-shot method achieved the highest classification accuracy. Re-identified data consistently outperformed de-identified data.

DISCUSSION

Privacy analysis revealed a critical balance between data utility and privacy protection, influencing future data use and sharing.

CONCLUSION

This study shows that keyword-based methods can create synthetic clinical notes that protect privacy while retaining data usability, potentially improving clinical data sharing. The use of dummy PHIs to counter privacy attacks may offer better utility and privacy than traditional de-identification.

摘要

目的

本研究评估电子健康记录(EHR)与自然语言处理(NLP)与大语言模型(LLM)的整合,以加强医疗数据管理和患者护理,重点是使用先进的语言模型创建符合《健康保险流通与责任法案》的安全合成患者记录,用于全球生物医学研究。

材料与方法

本研究使用了去识别化和重新识别化版本的MIMIC III数据集,结合GPT-3.5、GPT-4和米斯特拉尔7B来生成合成临床记录。文本生成采用模板和关键词提取以生成上下文相关的记录,并采用一次性生成进行比较。通过分析受保护健康信息(PHI)的出现和共现情况来评估隐私性,同时通过使用合成记录训练ICD-9编码员来评估实用性。使用ROUGE(面向召回率的摘要评估替代指标)和余弦相似度指标来衡量文本质量,以比较合成记录与源记录的语义相似度。

结果

通过ICD-9编码任务对PHI出现情况和文本实用性的分析表明,基于关键词的方法风险较低且性能良好。一次性生成表现出最高的PHI暴露和PHI共现,特别是在地理位置和日期类别中。归一化一次性方法实现了最高的分类准确率。重新识别化的数据始终优于去识别化的数据。

讨论

隐私分析揭示了数据实用性和隐私保护之间的关键平衡,这会影响未来的数据使用和共享。

结论

本研究表明,基于关键词的方法可以创建既能保护隐私又能保留数据可用性的合成临床记录,可能会改善临床数据共享。使用虚拟PHI来对抗隐私攻击可能比传统的去识别化提供更好的实用性和隐私性。

相似文献

本文引用的文献

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验