Henriksson Aron, Kvist Maria, Dalianis Hercules
Department of Computer and Systems Sciences, Stockholm University, Sweden.
Stud Health Technol Inform. 2017;235:216-220.
Obscuring protected health information (PHI) in the clinical text of health records facilitates the secondary use of healthcare data in a privacy-preserving manner. Although automatic de-identification of clinical text using machine learning holds much promise, little is known about the relative prevalence of PHI in different types of clinical text and whether there is a need for domain adaptation when learning predictive models from one particular domain and applying it to another. In this study, we address these questions by training a predictive model and using it to estimate the prevalence of PHI in clinical text written (1) in different clinical specialties, (2) in different types of notes (i.e., under different headings), and (3) by persons in different professional roles. It is demonstrated that the overall PHI density is 1.57%; however, substantial differences exist across domains.
在健康记录的临床文本中隐匿受保护的健康信息(PHI)有助于以保护隐私的方式二次利用医疗保健数据。尽管使用机器学习对临床文本进行自动去识别有很大前景,但对于不同类型临床文本中PHI的相对流行程度,以及从一个特定领域学习预测模型并将其应用于另一个领域时是否需要进行领域适应,人们了解甚少。在本研究中,我们通过训练一个预测模型并使用它来估计临床文本中PHI的流行程度,来解决这些问题,这些临床文本是由(1)不同临床专科的人员、(2)不同类型的记录(即不同标题下)、以及(3)不同专业角色的人员撰写的。结果表明,PHI的总体密度为1.57%;然而,不同领域之间存在显著差异。