School of Medicine and Health Management, Tongji Medical College, Huazhong University of Science and Technology, Hubei, China.
Dassault Systems, 175 Wyman St. Waltham, MA, 02451, USA.
Int J Med Inform. 2018 Aug;116:24-32. doi: 10.1016/j.ijmedinf.2018.05.010. Epub 2018 May 22.
With the increasing application of electronic health records (EHRs) in the world, protecting private information in clinical text has drawn extensive attention from healthcare providers to researchers. De-identification, the process of identifying and removing protected health information (PHI) from clinical text, has been central to the discourse on medical privacy since 2006. While de-identification is becoming the global norm for handling medical records, there is a paucity of studies on its application on Chinese clinical text. Without efficient and effective privacy protection algorithms in place, the use of indispensable clinical information would be confined.
We aimed to (i) describe the current process for PHI in China, (ii) propose a machine learning based approach to identify PHI in Chinese clinical text, and (iii) validate the effectiveness of the machine learning algorithm for de-identification in Chinese clinical text.
Based on 14,719 discharge summaries from regional health centers in Ya'an City, Sichuan province, China, we built a conditional random fields (CRF) model to identify PHI in clinical text, and then used the regular expressions to optimize the recognition results of the PHI categories with fewer samples.
We constructed a Chinese clinical text corpus with PHI tags through substantial manual annotation, wherein the descriptive statistics of PHI manifested its wide range and diverse categories. The evaluation showed with a high F-measure of 0.9878 that our CRF-based model had a good performance for identifying PHI in Chinese clinical text.
The rapid adoption of EHR in the health sector has created an urgent need for tools that can parse patient specific information from Chinese clinical text. Our application of CRF algorithms for de-identification has shown the potential to meet this need by offering a highly accurate and flexible solution to analyzing Chinese clinical text.
随着电子健康记录(EHR)在全球的应用越来越广泛,保护临床文本中的私人信息引起了医疗保健提供者和研究人员的广泛关注。自 2006 年以来,去识别化(即识别和删除临床文本中受保护健康信息(PHI)的过程)一直是医疗隐私讨论的核心。虽然去识别化已成为处理医疗记录的全球规范,但关于其在中文临床文本中的应用的研究却很少。如果没有高效和有效的隐私保护算法,就会限制对不可或缺的临床信息的使用。
我们旨在:(i)描述中国当前的 PHI 处理流程;(ii)提出一种基于机器学习的方法来识别中文临床文本中的 PHI;(iii)验证机器学习算法在中国临床文本去识别化中的有效性。
基于中国四川省雅安市区域卫生中心的 14719 份出院小结,我们构建了一个条件随机场(CRF)模型来识别临床文本中的 PHI,然后使用正则表达式优化具有较少样本的 PHI 类别识别结果。
我们通过大量手动标注构建了一个带有 PHI 标签的中文临床文本语料库,其中 PHI 的描述性统计数据表明其范围广泛且类别多样。评估结果表明,我们的基于 CRF 的模型在识别中文临床文本中的 PHI 方面表现良好,F1 分数高达 0.9878。
EHR 在医疗保健领域的快速采用,迫切需要能够从中文临床文本中解析患者特定信息的工具。我们应用 CRF 算法进行去识别化,为分析中文临床文本提供了一种高度准确和灵活的解决方案,显示出了满足这一需求的潜力。