Analytics Department & Data Factory, Institut de Cancérologie de l'Ouest, F-44805 Nantes-Angers, France.
Oncology Department, Institut de Cancérologie de l'Ouest, F-44805 Nantes-Angers, France.
Int J Environ Res Public Health. 2022 Apr 2;19(7):4272. doi: 10.3390/ijerph19074272.
Electronic Medical Records (EMR) and Electronic Health Records (EHR) are often missing critical information about the death of a patient, although it is an essential metric for medical research in oncology to assess survival outcomes, particularly for evaluating the efficacy of new therapeutic approaches. We used open government data in France from 1970 to September 2021 to identify deceased patients and match them with patient data collected from the Institut de Cancérologie de l'Ouest (ICO) data warehouse (Integrated Center of Oncology-the third largest cancer center in France) between January 2015 and November 2021. To meet our objective, we evaluated algorithms to perform a deterministic record linkage: an exact matching algorithm and a fuzzy matching algorithm. Because we lacked reference data, we needed to assess the algorithms by estimating the number of homonyms that could lead to false links, using the same open dataset of deceased persons in France. The exact matching algorithm allowed us to double the number of dates of death in the ICO data warehouse, and the fuzzy matching algorithm tripled it. Studying homonyms assured us that there was a low risk of misidentification, with precision values of 99.96% for the exact matching and 99.68% for the fuzzy matching. However, estimating the number of false negatives proved more difficult than anticipated. Nevertheless, using open government data can be a highly interesting way to improve the completeness of the date of death variable for oncology patients in data warehouses.
电子病历 (EMR) 和电子健康记录 (EHR) 通常会遗漏患者死亡的关键信息,尽管对于评估肿瘤学中的生存结果的医学研究来说,这是一个重要的指标,特别是对于评估新治疗方法的疗效。我们使用法国从 1970 年到 2021 年 9 月的公开政府数据来识别死亡患者,并将其与 2015 年 1 月至 2021 年 11 月期间从 ICO 数据仓库(法国第三大癌症中心——西部肿瘤学综合中心)收集的患者数据进行匹配。为了实现我们的目标,我们评估了确定性记录链接算法:精确匹配算法和模糊匹配算法。由于我们缺乏参考数据,我们需要通过使用相同的法国公开死亡人员数据集来评估算法,以估计可能导致错误链接的同音字数量。精确匹配算法使我们能够将 ICO 数据仓库中死亡日期的数量增加一倍,模糊匹配算法将其增加三倍。研究同音字使我们确信,误识别的风险很低,精确匹配的精度值为 99.96%,模糊匹配的精度值为 99.68%。然而,估计假阴性的数量比预期的要困难。尽管如此,使用公开政府数据可以是一种非常有趣的方法,可以提高数据仓库中肿瘤患者死亡日期变量的完整性。