通过对风险的批判性评估，在大语言模型创新中实现强大的隐私保护。

Robust privacy amidst innovation with large language models through a critical assessment of the risks.

作者信息

Chuang Yao-Shun, Sarkar Atiquer Rahman, Hsu Yu-Chun, Mohammed Noman, Jiang Xiaoqian

机构信息

McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, United States.

Department of Computer Science, University of Manitoba, Winnipeg, Manitoba R3T 5V6, Canada.

出版信息

J Am Med Inform Assoc. 2025 May 1;32(5):885-892. doi: 10.1093/jamia/ocaf037.

DOI:10.1093/jamia/ocaf037

PMID:40112189

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12012348/

Abstract

OBJECTIVE

This study evaluates the integration of electronic health records (EHRs) and natural language processing (NLP) with large language models (LLMs) to enhance healthcare data management and patient care, focusing on using advanced language models to create secure, Health Insurance Portability and Accountability Act-compliant synthetic patient notes for global biomedical research.

MATERIALS AND METHODS

The study used de-identified and re-identified versions of the MIMIC III dataset with GPT-3.5, GPT-4, and Mistral 7B to generate synthetic clinical notes. Text generation employed templates and keyword extraction for contextually relevant notes, with One-shot generation for comparison. Privacy was assessed by analyzing protected health information (PHI) occurrence and co-occurrence, while utility was evaluated by training an ICD-9 coder using synthetic notes. Text quality was measured using ROUGE (Recall-Oriented Understudy for Gisting Evaluation) and cosine similarity metrics to compare synthetic notes with source notes for semantic similarity.

RESULTS

The analysis of PHI occurrence and text utility via the ICD-9 coding task showed that the keyword-based method had low risk and good performance. One-shot generation exhibited the highest PHI exposure and PHI co-occurrence, particularly in geographic location and date categories. The Normalized One-shot method achieved the highest classification accuracy. Re-identified data consistently outperformed de-identified data.

DISCUSSION

Privacy analysis revealed a critical balance between data utility and privacy protection, influencing future data use and sharing.

CONCLUSION

This study shows that keyword-based methods can create synthetic clinical notes that protect privacy while retaining data usability, potentially improving clinical data sharing. The use of dummy PHIs to counter privacy attacks may offer better utility and privacy than traditional de-identification.

摘要

目的

本研究评估电子健康记录（EHR）与自然语言处理（NLP）与大语言模型（LLM）的整合，以加强医疗数据管理和患者护理，重点是使用先进的语言模型创建符合《健康保险流通与责任法案》的安全合成患者记录，用于全球生物医学研究。

材料与方法

本研究使用了去识别化和重新识别化版本的MIMIC III数据集，结合GPT-3.5、GPT-4和米斯特拉尔7B来生成合成临床记录。文本生成采用模板和关键词提取以生成上下文相关的记录，并采用一次性生成进行比较。通过分析受保护健康信息（PHI）的出现和共现情况来评估隐私性，同时通过使用合成记录训练ICD-9编码员来评估实用性。使用ROUGE（面向召回率的摘要评估替代指标）和余弦相似度指标来衡量文本质量，以比较合成记录与源记录的语义相似度。

结果

通过ICD-9编码任务对PHI出现情况和文本实用性的分析表明，基于关键词的方法风险较低且性能良好。一次性生成表现出最高的PHI暴露和PHI共现，特别是在地理位置和日期类别中。归一化一次性方法实现了最高的分类准确率。重新识别化的数据始终优于去识别化的数据。

讨论

隐私分析揭示了数据实用性和隐私保护之间的关键平衡，这会影响未来的数据使用和共享。

结论

本研究表明，基于关键词的方法可以创建既能保护隐私又能保留数据可用性的合成临床记录，可能会改善临床数据共享。使用虚拟PHI来对抗隐私攻击可能比传统的去识别化提供更好的实用性和隐私性。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

通过对风险的批判性评估，在大语言模型创新中实现强大的隐私保护。

Robust privacy amidst innovation with large language models through a critical assessment of the risks.

作者信息

机构信息

出版信息

OBJECTIVE

MATERIALS AND METHODS

RESULTS

DISCUSSION

CONCLUSION

目的

材料与方法

结果

讨论

结论

相似文献

引用本文的文献

本文引用的文献

通过对风险的批判性评估，在大语言模型创新中实现强大的隐私保护。

Robust privacy amidst innovation with large language models through a critical assessment of the risks.

作者信息

机构信息

出版信息

OBJECTIVE

MATERIALS AND METHODS

RESULTS

DISCUSSION

CONCLUSION

目的

材料与方法

结果

讨论

结论

相似文献

引用本文的文献

本文引用的文献