Department of Computer Science, University of Manitoba, Winnipeg, MB, R3T 2N2, Canada.
Department of Biomedical Informatics, University of California San Diego, La Jolla, CA, USA.
BMC Med Inform Decis Mak. 2019 Sep 7;19(1):183. doi: 10.1186/s12911-019-0867-z.
Medical data sharing is a big challenge in biomedicine, which often hinders collaborative research. Due to privacy concerns, clinical notes cannot be directly shared. A lot of efforts have been dedicated to de-identifying clinical notes but it is still very challenging to accurately locate and scrub all sensitive elements from notes in an automatic manner. An alternative approach is to remove sentences that might contain sensitive terms related to personal information.
A previous study introduced a frequency-based filtering approach that removes sentences containing low frequency bigrams to improve the privacy protection without significantly decreasing the utility. Our work extends this method to consider clinical notes from distributed sources with security and privacy considerations. We developed a novel secure protocol based on private set intersection and secure thresholding to identify uncommon and low-frequency terms, which can be used to guide sentence filtering.
As the computational cost of our proposed framework mostly depends on the cardinality of the intersection of the sets and the number of data owners, we evaluated the framework in terms of these two factors. Experimental results demonstrate that our proposed method is scalable in various experimental settings. In addition, we evaluated our framework in terms of data utility. This evaluation shows that the proposed method is able to retain enough information for data analysis.
This work demonstrates the feasibility of using homomorphic encryption to develop a secure and efficient multi-party protocol.
医学数据共享是生物医学领域的一大挑战,这往往会阻碍合作研究。由于隐私问题,临床笔记不能直接共享。人们已经付出了很多努力来对临床笔记进行去识别化,但要自动准确地定位和清除笔记中所有敏感元素仍然极具挑战性。另一种方法是删除可能包含与个人信息相关的敏感术语的句子。
先前的研究提出了一种基于频率的过滤方法,该方法通过删除包含低频二元组的句子来提高隐私保护,而不会显著降低效用。我们的工作扩展了这种方法,以考虑具有安全和隐私考虑的分布式来源的临床笔记。我们开发了一种新的基于私有集合交集和安全阈值的安全协议,以识别不常见和低频的术语,这些术语可用于指导句子过滤。
由于我们提出的框架的计算成本主要取决于集合交集的基数和数据所有者的数量,因此我们根据这两个因素对框架进行了评估。实验结果表明,我们提出的方法在各种实验设置中是可扩展的。此外,我们还根据数据效用评估了我们的框架。该评估表明,所提出的方法能够保留足够的信息进行数据分析。
这项工作证明了使用同态加密来开发安全高效的多方协议是可行的。