Mawji Alishah, Longstaff Holly, Trawin Jessica, Dunsmuir Dustin, Komugisha Clare, Novakowski Stefanie K, Wiens Matthew O, Akech Samuel, Tagoola Abner, Kissoon Niranjan, Ansermino J Mark
Department of Anesthesiology, Pharmacology & Therapeutics, University of British Columbia, Vancouver, British Columbia, Canada.
Centre for International Child Health, BC Children's Hospital Research Institute, Vancouver, British Columbia, Canada.
PLOS Digit Health. 2022 Aug 24;1(8):e0000027. doi: 10.1371/journal.pdig.0000027. eCollection 2022 Aug.
Data sharing has enormous potential to accelerate and improve the accuracy of research, strengthen collaborations, and restore trust in the clinical research enterprise. Nevertheless, there remains reluctancy to openly share raw data sets, in part due to concerns regarding research participant confidentiality and privacy. Statistical data de-identification is an approach that can be used to preserve privacy and facilitate open data sharing. We have proposed a standardized framework for the de-identification of data generated from cohort studies in children in a low-and-middle income country. We applied a standardized de-identification framework to a data sets comprised of 241 health related variables collected from a cohort of 1750 children with acute infections from Jinja Regional Referral Hospital in Eastern Uganda. Variables were labeled as direct and quasi-identifiers based on conditions of replicability, distinguishability, and knowability with consensus from two independent evaluators. Direct identifiers were removed from the data sets, while a statistical risk-based de-identification approach using the k-anonymity model was applied to quasi-identifiers. Qualitative assessment of the level of privacy invasion associated with data set disclosure was used to determine an acceptable re-identification risk threshold, and corresponding k-anonymity requirement. A de-identification model using generalization, followed by suppression was applied using a logical stepwise approach to achieve k-anonymity. The utility of the de-identified data was demonstrated using a typical clinical regression example. The de-identified data sets was published on the Pediatric Sepsis Data CoLaboratory Dataverse which provides moderated data access. Researchers are faced with many challenges when providing access to clinical data. We provide a standardized de-identification framework that can be adapted and refined based on specific context and risks. This process will be combined with moderated access to foster coordination and collaboration in the clinical research community.
数据共享在加速和提高研究准确性、加强合作以及恢复对临床研究企业的信任方面具有巨大潜力。然而,公开共享原始数据集仍存在阻力,部分原因是担心研究参与者的保密性和隐私。统计数据去识别是一种可用于保护隐私并促进开放数据共享的方法。我们提出了一个标准化框架,用于对低收入和中等收入国家儿童队列研究产生的数据进行去识别。我们将一个标准化的去识别框架应用于一个数据集,该数据集由从乌干达东部金贾地区转诊医院的1750名急性感染儿童队列中收集的241个与健康相关的变量组成。根据可复制性、可区分性和可识别性条件,并经两名独立评估人员达成共识,将变量标记为直接标识符和准标识符。从数据集中删除直接标识符,同时对准标识符应用基于统计风险的去识别方法,即k匿名模型。通过对与数据集披露相关的隐私侵犯程度进行定性评估,以确定可接受的重新识别风险阈值和相应的k匿名要求。使用一种逻辑逐步方法应用一种先进行泛化然后抑制的去识别模型,以实现k匿名。通过一个典型的临床回归示例展示了去识别后数据的效用。去识别后的数据集发布在儿科脓毒症数据合作实验室数据存储库上,该存储库提供适度的数据访问。研究人员在提供临床数据访问时面临许多挑战。我们提供了一个标准化的去识别框架,该框架可以根据具体情况和风险进行调整和完善。这一过程将与适度访问相结合,以促进临床研究社区的协调与合作。