Chen Fangyi, Cato Kenrick, Gürsoy Gamze, Dykes Patricia C, Lowenthal Graham, Rossetti Sarah
Department of Biomedical Informatics, Columbia University, New York, NY, United States.
School of Nursing, University of Pennsylvania, Philadelphia, PA, United States.
AMIA Annu Symp Proc. 2025 May 22;2024:262-270. eCollection 2024.
Making clinical datasets openly available is critical to promote reproducibility and transparency of scientific research. Currently, few datasets are accessible to the public. To support the open science initiative, we plan to release the structured clinical datasets from the CONCERN study. In this paper, we are presenting our de-identification approaches for structured data, considering the future inclusion of de-identified narrative notes and re-identification risks in the LLM era. Through literature review and collaborative consensus sessions, our team made informed decisions regarding dataset release, weighing the pros and cons of each choice, outlining limitation and bias introduced by the de-identification algorithm. To our best knowledge, this is the first study describing the rationales of de-identification decisions in the LLMs era, delineating the consequent problems that should be considered when using our data set. We advocate for transparent disclosure of de-identification decisions and associated limitations and biases with all openly available datasets.
使临床数据集公开可用对于促进科学研究的可重复性和透明度至关重要。目前,公众能够获取的数据集很少。为了支持开放科学倡议,我们计划发布来自CONCERN研究的结构化临床数据集。在本文中,考虑到未来去识别化的叙述性记录的纳入以及大语言模型时代的重新识别风险,我们展示了针对结构化数据的去识别化方法。通过文献综述和协作共识会议,我们的团队就数据集发布做出了明智的决策,权衡了每个选择的利弊,概述了去识别化算法引入的局限性和偏差。据我们所知,这是第一项描述大语言模型时代去识别化决策基本原理的研究,阐述了使用我们的数据集时应考虑的相关问题。我们主张对所有公开可用的数据集透明披露去识别化决策以及相关的局限性和偏差。