• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

探索词嵌入的隐私保护特性:算法验证研究

Exploring the Privacy-Preserving Properties of Word Embeddings: Algorithmic Validation Study.

作者信息

Abdalla Mohamed, Abdalla Moustafa, Hirst Graeme, Rudzicz Frank

机构信息

Department of Computer Science, University of Toronto, Toronto, ON, Canada.

The Vector Institute for Artificial Intelligence, Toronto, ON, Canada.

出版信息

J Med Internet Res. 2020 Jul 15;22(7):e18055. doi: 10.2196/18055.

DOI:10.2196/18055
PMID:32673230
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7391163/
Abstract

BACKGROUND

Word embeddings are dense numeric vectors used to represent language in neural networks. Until recently, there had been no publicly released embeddings trained on clinical data. Our work is the first to study the privacy implications of releasing these models.

OBJECTIVE

This paper aims to demonstrate that traditional word embeddings created on clinical corpora that have been deidentified by removing personal health information (PHI) can nonetheless be exploited to reveal sensitive patient information.

METHODS

We used embeddings created from 400,000 doctor-written consultation notes and experimented with 3 common word embedding methods to explore the privacy-preserving properties of each.

RESULTS

We found that if publicly released embeddings are trained from a corpus anonymized by PHI removal, it is possible to reconstruct up to 68.5% (n=411/600) of the full names that remain in the deidentified corpus and associated sensitive information to specific patients in the corpus from which the embeddings were created. We also found that the distance between the word vector representation of a patient's name and a diagnostic billing code is informative and differs significantly from the distance between the name and a code not billed for that patient.

CONCLUSIONS

Special care must be taken when sharing word embeddings created from clinical texts, as current approaches may compromise patient privacy. If PHI removal is used for anonymization before traditional word embeddings are trained, it is possible to attribute sensitive information to patients who have not been fully deidentified by the (necessarily imperfect) removal algorithms. A promising alternative (ie, anonymization by PHI replacement) may avoid these flaws. Our results are timely and critical, as an increasing number of researchers are pushing for publicly available health data.

摘要

背景

词嵌入是用于在神经网络中表示语言的密集数值向量。直到最近,还没有公开发布的在临床数据上训练的嵌入。我们的工作是首次研究发布这些模型对隐私的影响。

目的

本文旨在证明,在通过去除个人健康信息(PHI)进行去识别处理的临床语料库上创建的传统词嵌入,仍然可能被利用来揭示敏感的患者信息。

方法

我们使用从40万份医生撰写的会诊记录中创建的嵌入,并试验了3种常见的词嵌入方法,以探索每种方法的隐私保护特性。

结果

我们发现,如果公开发布的嵌入是从通过去除PHI进行匿名化处理的语料库中训练得到的,那么有可能从去识别后的语料库中重建高达68.5%(n = 411/600)的全名,并将相关的敏感信息与创建嵌入的语料库中的特定患者关联起来。我们还发现,患者姓名的词向量表示与诊断计费代码之间的距离具有信息价值,并且与该患者未计费代码之间的距离有显著差异。

结论

在共享从临床文本创建的词嵌入时必须格外小心,因为当前方法可能会损害患者隐私。如果在训练传统词嵌入之前使用去除PHI进行匿名化处理,那么有可能将敏感信息归因于那些未被(必然不完美的)去除算法完全去识别的患者。一种有前景的替代方法(即通过替换PHI进行匿名化)可能会避免这些缺陷。我们的结果及时且关键,因为越来越多的研究人员正在推动公开可用的健康数据。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5d1c/7391163/d8987ff67953/jmir_v22i7e18055_fig7.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5d1c/7391163/a051453108e2/jmir_v22i7e18055_fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5d1c/7391163/4ba2c39c7d5c/jmir_v22i7e18055_fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5d1c/7391163/538d34fd8e48/jmir_v22i7e18055_fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5d1c/7391163/01ae03ab102c/jmir_v22i7e18055_fig4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5d1c/7391163/b34d53a3b2cb/jmir_v22i7e18055_fig5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5d1c/7391163/c4dac43d57c4/jmir_v22i7e18055_fig6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5d1c/7391163/d8987ff67953/jmir_v22i7e18055_fig7.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5d1c/7391163/a051453108e2/jmir_v22i7e18055_fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5d1c/7391163/4ba2c39c7d5c/jmir_v22i7e18055_fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5d1c/7391163/538d34fd8e48/jmir_v22i7e18055_fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5d1c/7391163/01ae03ab102c/jmir_v22i7e18055_fig4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5d1c/7391163/b34d53a3b2cb/jmir_v22i7e18055_fig5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5d1c/7391163/c4dac43d57c4/jmir_v22i7e18055_fig6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5d1c/7391163/d8987ff67953/jmir_v22i7e18055_fig7.jpg

相似文献

1
Exploring the Privacy-Preserving Properties of Word Embeddings: Algorithmic Validation Study.探索词嵌入的隐私保护特性:算法验证研究
J Med Internet Res. 2020 Jul 15;22(7):e18055. doi: 10.2196/18055.
2
Using word embeddings to improve the privacy of clinical notes.利用词嵌入技术提高临床笔记的隐私性。
J Am Med Inform Assoc. 2020 Jun 1;27(6):901-907. doi: 10.1093/jamia/ocaa038.
3
A comparison of word embeddings for the biomedical natural language processing.生物医学自然语言处理中词嵌入的比较。
J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.
4
A study of deep learning methods for de-identification of clinical notes in cross-institute settings.深度学习方法在跨机构环境下对临床记录进行去识别的研究。
BMC Med Inform Decis Mak. 2019 Dec 5;19(Suppl 5):232. doi: 10.1186/s12911-019-0935-4.
5
Word embeddings trained on published case reports are lightweight, effective for clinical tasks, and free of protected health information.基于已发表病例报告训练的词嵌入模型轻巧、适用于临床任务且不包含受保护的健康信息。
J Biomed Inform. 2022 Jan;125:103971. doi: 10.1016/j.jbi.2021.103971. Epub 2021 Dec 14.
6
Training and intrinsic evaluation of lightweight word embeddings for the clinical domain in Spanish.西班牙语临床领域轻量级词嵌入的训练与内在评估
Front Artif Intell. 2022 Sep 21;5:970517. doi: 10.3389/frai.2022.970517. eCollection 2022.
7
A Study of Neural Word Embeddings for Named Entity Recognition in Clinical Text.用于临床文本中命名实体识别的神经词嵌入研究
AMIA Annu Symp Proc. 2015 Nov 5;2015:1326-33. eCollection 2015.
8
Visualization of medical concepts represented using word embeddings: a scoping review.基于词向量表示的医学概念可视化:范围综述。
BMC Med Inform Decis Mak. 2022 Mar 29;22(1):83. doi: 10.1186/s12911-022-01822-9.
9
The Impact of Specialized Corpora for Word Embeddings in Natural Langage Understanding.专业语料库对自然语言理解中词嵌入的影响。
Stud Health Technol Inform. 2020 Jun 16;270:432-436. doi: 10.3233/SHTI200197.
10
Automatic Correction of Real-Word Errors in Spanish Clinical Texts.西班牙语临床文本中真实错误的自动纠正。
Sensors (Basel). 2021 Apr 21;21(9):2893. doi: 10.3390/s21092893.

引用本文的文献

1
The Impact of Collaborative Documentation on Person-Centered Care: Textual Analysis of Clinical Notes.协作式文档对以患者为中心的护理的影响:临床记录的文本分析。
JMIR Med Inform. 2024 Sep 20;12:e52678. doi: 10.2196/52678.
2
Microsnoop: A generalist tool for microscopy image representation.微窥探:一种用于显微镜图像表示的通用工具。
Innovation (Camb). 2024 Jan 2;5(1):100541. doi: 10.1016/j.xinn.2023.100541. eCollection 2024 Jan 8.
3
Word embeddings trained on published case reports are lightweight, effective for clinical tasks, and free of protected health information.

本文引用的文献

1
Enhancing clinical concept extraction with contextual embeddings.利用上下文嵌入增强临床概念提取。
J Am Med Inform Assoc. 2019 Nov 1;26(11):1297-1304. doi: 10.1093/jamia/ocz096.
2
A comparison of word embeddings for the biomedical natural language processing.生物医学自然语言处理中词嵌入的比较。
J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.
3
Comparing deep learning and concept extraction based methods for patient phenotyping from clinical narratives.比较基于深度学习和概念提取的方法用于从临床叙述中进行患者表型分析。
基于已发表病例报告训练的词嵌入模型轻巧、适用于临床任务且不包含受保护的健康信息。
J Biomed Inform. 2022 Jan;125:103971. doi: 10.1016/j.jbi.2021.103971. Epub 2021 Dec 14.
4
Health, Psychosocial, and Social Issues Emanating From the COVID-19 Pandemic Based on Social Media Comments: Text Mining and Thematic Analysis Approach.基于社交媒体评论的COVID-19大流行引发的健康、心理社会和社会问题:文本挖掘与主题分析方法
JMIR Med Inform. 2021 Apr 6;9(4):e22734. doi: 10.2196/22734.
5
Automatic multilabel detection of ICD10 codes in Dutch cardiology discharge letters using neural networks.使用神经网络自动多标签检测荷兰心脏病学出院小结中的ICD10编码
NPJ Digit Med. 2021 Feb 26;4(1):37. doi: 10.1038/s41746-021-00404-9.
PLoS One. 2018 Feb 15;13(2):e0192360. doi: 10.1371/journal.pone.0192360. eCollection 2018.
4
De-identification of patient notes with recurrent neural networks.使用递归神经网络对患者记录进行去识别化处理。
J Am Med Inform Assoc. 2017 May 1;24(3):596-606. doi: 10.1093/jamia/ocw156.
5
MIMIC-III, a freely accessible critical care database.MIMIC-III,一个免费获取的重症监护数据库。
Sci Data. 2016 May 24;3:160035. doi: 10.1038/sdata.2016.35.
6
Strategies for de-identification and anonymization of electronic health record data for use in multicenter research studies.用于多中心研究的电子健康记录数据去识别和匿名化策略。
Med Care. 2012 Jul;50 Suppl(Suppl):S82-101. doi: 10.1097/MLR.0b013e3182585355.
7
Automated de-identification of free-text medical records.自由文本医疗记录的自动去识别化
BMC Med Inform Decis Mak. 2008 Jul 24;8:32. doi: 10.1186/1472-6947-8-32.
8
A successful technique for removing names in pathology reports using an augmented search and replace method.一种使用增强型查找和替换方法在病理报告中删除姓名的成功技术。
Proc AMIA Symp. 2002:777-81.