Department of Information Medicine, Asan Medical Center, University of Ulsan College of Medicine, Seoul, Korea.
Department of Biomedical Systems Informatics, Yonsei University College of Medicine, Seoul, Korea.
J Korean Med Sci. 2022 Jul 4;37(26):e205. doi: 10.3346/jkms.2022.37.e205.
The advancement of information technology has immensely increased the quality and volume of health data. This has led to an increase in observational study, as well as to the threat of privacy invasion. Recently, a distributed research network based on the common data model (CDM) has emerged, enabling collaborative international medical research without sharing patient-level data. Although the CDM database for each institution is built inside a firewall, the risk of re-identification requires management. Hence, this study aims to elucidate the perceptions CDM users have towards CDM and risk management for re-identification.
The survey, targeted to answer specific in-depth questions on CDM, was conducted from October to November 2020. We targeted well-experienced researchers who actively use CDM. Basic statistics (total number and percent) were computed for all covariates.
There were 33 valid respondents. Of these, 43.8% suggested additional anonymization was unnecessary beyond, "minimum cell count" policy, which obscures a cell with a value lower than certain number (usually 5) in shared results to minimize the liability of re-identification due to rare conditions. During extract-transform-load processes, 81.8% of respondents assumed structured data is under control from the risk of re-identification. However, respondents noted that date of birth and death were highly re-identifiable information. The majority of respondents (n = 22, 66.7%) conceded the possibility of identifier-contained unstructured data in the table.
Overall, CDM users generally attributed high reliability for privacy protection to the intrinsic nature of CDM. There was little demand for additional de-identification methods. However, unstructured data in the CDM were suspected to have risks. The necessity for a coordinating consortium to define and manage the re-identification risk of CDM was urged.
信息技术的进步极大地提高了医疗数据的质量和数量。这导致观察性研究的增加,以及隐私侵犯的威胁。最近,出现了一种基于通用数据模型 (CDM) 的分布式研究网络,使国际医疗合作研究能够在不共享患者数据的情况下进行。虽然每个机构的 CDM 数据库都构建在防火墙内,但重新识别的风险仍需要管理。因此,本研究旨在阐明 CDM 用户对 CDM 的看法以及重新识别风险的管理。
这项调查旨在回答有关 CDM 的具体深入问题,于 2020 年 10 月至 11 月进行。我们的目标是经验丰富且积极使用 CDM 的研究人员。对所有协变量进行了基本统计(总数和百分比)计算。
共有 33 份有效回复。其中,43.8%的受访者认为,除了“最小单元格计数”政策之外,不需要进行额外的匿名化处理,该政策通过将共享结果中低于特定数字(通常为 5)的单元格模糊化,以最小化因罕见情况导致重新识别的责任。在提取-转换-加载过程中,81.8%的受访者认为结构化数据受到控制,不会有重新识别的风险。然而,受访者指出出生日期和死亡日期是高度可重新识别的信息。大多数受访者(n=22,66.7%)认为表中可能包含标识符的非结构化数据。
总体而言,CDM 用户通常认为 CDM 的内在性质能够提供高度可靠的隐私保护。他们几乎没有要求使用额外的去识别方法。然而,CDM 中的非结构化数据被怀疑存在风险。需要一个协调的联盟来定义和管理 CDM 的重新识别风险。