Chuang Yao-Shun, Sarkar Atiquer Rahman, Hsu Yu-Chun, Mohammed Noman, Jiang Xiaoqian
McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, United States.
Department of Computer Science, University of Manitoba, Winnipeg, Manitoba R3T 5V6, Canada.
J Am Med Inform Assoc. 2025 May 1;32(5):885-892. doi: 10.1093/jamia/ocaf037.
This study evaluates the integration of electronic health records (EHRs) and natural language processing (NLP) with large language models (LLMs) to enhance healthcare data management and patient care, focusing on using advanced language models to create secure, Health Insurance Portability and Accountability Act-compliant synthetic patient notes for global biomedical research.
The study used de-identified and re-identified versions of the MIMIC III dataset with GPT-3.5, GPT-4, and Mistral 7B to generate synthetic clinical notes. Text generation employed templates and keyword extraction for contextually relevant notes, with One-shot generation for comparison. Privacy was assessed by analyzing protected health information (PHI) occurrence and co-occurrence, while utility was evaluated by training an ICD-9 coder using synthetic notes. Text quality was measured using ROUGE (Recall-Oriented Understudy for Gisting Evaluation) and cosine similarity metrics to compare synthetic notes with source notes for semantic similarity.
The analysis of PHI occurrence and text utility via the ICD-9 coding task showed that the keyword-based method had low risk and good performance. One-shot generation exhibited the highest PHI exposure and PHI co-occurrence, particularly in geographic location and date categories. The Normalized One-shot method achieved the highest classification accuracy. Re-identified data consistently outperformed de-identified data.
Privacy analysis revealed a critical balance between data utility and privacy protection, influencing future data use and sharing.
This study shows that keyword-based methods can create synthetic clinical notes that protect privacy while retaining data usability, potentially improving clinical data sharing. The use of dummy PHIs to counter privacy attacks may offer better utility and privacy than traditional de-identification.
本研究评估电子健康记录(EHR)与自然语言处理(NLP)与大语言模型(LLM)的整合,以加强医疗数据管理和患者护理,重点是使用先进的语言模型创建符合《健康保险流通与责任法案》的安全合成患者记录,用于全球生物医学研究。
本研究使用了去识别化和重新识别化版本的MIMIC III数据集,结合GPT-3.5、GPT-4和米斯特拉尔7B来生成合成临床记录。文本生成采用模板和关键词提取以生成上下文相关的记录,并采用一次性生成进行比较。通过分析受保护健康信息(PHI)的出现和共现情况来评估隐私性,同时通过使用合成记录训练ICD-9编码员来评估实用性。使用ROUGE(面向召回率的摘要评估替代指标)和余弦相似度指标来衡量文本质量,以比较合成记录与源记录的语义相似度。
通过ICD-9编码任务对PHI出现情况和文本实用性的分析表明,基于关键词的方法风险较低且性能良好。一次性生成表现出最高的PHI暴露和PHI共现,特别是在地理位置和日期类别中。归一化一次性方法实现了最高的分类准确率。重新识别化的数据始终优于去识别化的数据。
隐私分析揭示了数据实用性和隐私保护之间的关键平衡,这会影响未来的数据使用和共享。
本研究表明,基于关键词的方法可以创建既能保护隐私又能保留数据可用性的合成临床记录,可能会改善临床数据共享。使用虚拟PHI来对抗隐私攻击可能比传统的去识别化提供更好的实用性和隐私性。