Pantazos Kostas, Lauesen Soren, Lippert Soren
Software Development Group, IT-University of Copenhagen, Denmark.
Stud Health Technol Inform. 2011;169:862-6.
Electronic health records (EHR) contain a large amount of structured data and free text. Exploring and sharing clinical data can improve healthcare and facilitate the development of medical software. However, revealing confidential information is against ethical principles and laws. We de-identified a Danish EHR database with 437,164 patients. The goal was to generate a version with real medical records, but related to artificial persons. We developed a de-identification algorithm that uses lists of named entities, simple language analysis, and special rules. Our algorithm consists of 3 steps: collect lists of identifiers from the database and external resources, define a replacement for each identifier, and replace identifiers in structured data and free text. Some patient records could not be safely de-identified, so the de-identified database has 323,122 patient records with an acceptable degree of anonymity, readability and correctness (F-measure of 95%). The algorithm has to be adjusted for each culture, language and database.
电子健康记录(EHR)包含大量结构化数据和自由文本。探索和共享临床数据可以改善医疗保健并促进医疗软件的开发。然而,泄露机密信息违反伦理原则和法律。我们对一个拥有437,164名患者的丹麦EHR数据库进行了去识别处理。目标是生成一个包含真实医疗记录但与虚构人物相关的版本。我们开发了一种去识别算法,该算法使用命名实体列表、简单语言分析和特殊规则。我们的算法包括3个步骤:从数据库和外部资源收集标识符列表,为每个标识符定义替换项,以及替换结构化数据和自由文本中的标识符。一些患者记录无法安全地进行去识别处理,因此去识别后的数据库有323,122条患者记录,具有可接受的匿名程度、可读性和正确性(F值为95%)。该算法必须针对每种文化、语言和数据库进行调整。