Li Xiao-Bai, Qin Jialun
Department of Operations and Information Systems, Manning School of Business, University of Massachusetts Lowell, Lowell, Massachusetts 01854.
Inf Syst Res. 2017;28(2):332-352. doi: 10.1287/isre.2016.0676. Epub 2017 Apr 12.
Health information technology has increased accessibility of health and medical data and benefited medical research and healthcare management. However, there are rising concerns about patient privacy in sharing medical and healthcare data. A large amount of these data are in free text form. Existing techniques for privacy-preserving data sharing deal largely with structured data. Current privacy approaches for medical text data focus on detection and removal of patient identifiers from the data, which may be inadequate for protecting privacy or preserving data quality. We propose a new systematic approach to extract, cluster, and anonymize medical text records. Our approach integrates methods developed in both data privacy and health informatics fields. The key novel elements of our approach include a recursive partitioning method to cluster medical text records based on the similarity of the health and medical information and a value-enumeration method to anonymize potentially identifying information in the text data. An experimental study is conducted using real-world medical documents. The results of the experiments demonstrate the effectiveness of the proposed approach.
健康信息技术提高了健康和医疗数据的可获取性,对医学研究和医疗保健管理有益。然而,在共享医疗和保健数据时,患者隐私问题日益受到关注。这些数据中有大量是自由文本形式。现有的隐私保护数据共享技术主要处理结构化数据。当前针对医学文本数据的隐私方法主要集中于从数据中检测和去除患者标识符,这对于保护隐私或保持数据质量可能并不足够。我们提出一种新的系统方法来提取、聚类和匿名化医学文本记录。我们的方法整合了数据隐私和健康信息学领域开发的方法。我们方法的关键新颖元素包括一种基于健康和医学信息的相似性对医学文本记录进行聚类的递归划分方法,以及一种对文本数据中潜在的识别信息进行匿名化的数值枚举方法。使用真实世界的医学文档进行了一项实验研究。实验结果证明了所提方法的有效性。