Intelligent System Lab, College of Electrical Engineering and Computer Science, Department of Electrical Engineering, National Kaohsiung University Science and Technology, Kaohsiung, Taiwan R.O.C.
Department of Psychiatry, National Taiwan University Hospital, Taipei, Taiwan R.O.C.
Stud Health Technol Inform. 2022 Jun 6;290:627-631. doi: 10.3233/SHTI220153.
Electronic health records (EHRs) at medical institutions provide valuable sources for research in both clinical and biomedical domains. However, before such records can be used for research purposes, protected health information (PHI) mentioned in the unstructured text must be removed. In Taiwan's EHR systems the unstructured EHR texts are usually represented in the mixing of English and Chinese languages, which brings challenges for de-identification. This paper presented the first study, to the best of our knowledge, of the construction of a code-mixed EHR de-identification corpus and the evaluation of different mature entity recognition methods applied for the code-mixed PHI recognition task.
医疗机构的电子健康记录 (EHR) 为临床和生物医学领域的研究提供了有价值的资源。然而,在将这些记录用于研究目的之前,必须删除非结构化文本中提到的受保护健康信息 (PHI)。在台湾的 EHR 系统中,非结构化的 EHR 文本通常是英文和中文混合表示的,这给去识别带来了挑战。本文首次构建了一个代码混合 EHR 去识别语料库,并评估了不同成熟的实体识别方法在代码混合 PHI 识别任务中的应用,据我们所知,这是该领域的首次研究。