Abdalla Mohamed, Abdalla Moustafa, Hirst Graeme, Rudzicz Frank
Department of Computer Science, University of Toronto, Toronto, ON, Canada.
The Vector Institute for Artificial Intelligence, Toronto, ON, Canada.
J Med Internet Res. 2020 Jul 15;22(7):e18055. doi: 10.2196/18055.
Word embeddings are dense numeric vectors used to represent language in neural networks. Until recently, there had been no publicly released embeddings trained on clinical data. Our work is the first to study the privacy implications of releasing these models.
This paper aims to demonstrate that traditional word embeddings created on clinical corpora that have been deidentified by removing personal health information (PHI) can nonetheless be exploited to reveal sensitive patient information.
We used embeddings created from 400,000 doctor-written consultation notes and experimented with 3 common word embedding methods to explore the privacy-preserving properties of each.
We found that if publicly released embeddings are trained from a corpus anonymized by PHI removal, it is possible to reconstruct up to 68.5% (n=411/600) of the full names that remain in the deidentified corpus and associated sensitive information to specific patients in the corpus from which the embeddings were created. We also found that the distance between the word vector representation of a patient's name and a diagnostic billing code is informative and differs significantly from the distance between the name and a code not billed for that patient.
Special care must be taken when sharing word embeddings created from clinical texts, as current approaches may compromise patient privacy. If PHI removal is used for anonymization before traditional word embeddings are trained, it is possible to attribute sensitive information to patients who have not been fully deidentified by the (necessarily imperfect) removal algorithms. A promising alternative (ie, anonymization by PHI replacement) may avoid these flaws. Our results are timely and critical, as an increasing number of researchers are pushing for publicly available health data.
词嵌入是用于在神经网络中表示语言的密集数值向量。直到最近,还没有公开发布的在临床数据上训练的嵌入。我们的工作是首次研究发布这些模型对隐私的影响。
本文旨在证明,在通过去除个人健康信息(PHI)进行去识别处理的临床语料库上创建的传统词嵌入,仍然可能被利用来揭示敏感的患者信息。
我们使用从40万份医生撰写的会诊记录中创建的嵌入,并试验了3种常见的词嵌入方法,以探索每种方法的隐私保护特性。
我们发现,如果公开发布的嵌入是从通过去除PHI进行匿名化处理的语料库中训练得到的,那么有可能从去识别后的语料库中重建高达68.5%(n = 411/600)的全名,并将相关的敏感信息与创建嵌入的语料库中的特定患者关联起来。我们还发现,患者姓名的词向量表示与诊断计费代码之间的距离具有信息价值,并且与该患者未计费代码之间的距离有显著差异。
在共享从临床文本创建的词嵌入时必须格外小心,因为当前方法可能会损害患者隐私。如果在训练传统词嵌入之前使用去除PHI进行匿名化处理,那么有可能将敏感信息归因于那些未被(必然不完美的)去除算法完全去识别的患者。一种有前景的替代方法(即通过替换PHI进行匿名化)可能会避免这些缺陷。我们的结果及时且关键,因为越来越多的研究人员正在推动公开可用的健康数据。