• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

在大语言模型时代,当使去识别化的结构化数据集公开可用时,致力于识别新的风险规避以及随之而来的限制和偏差。

Toward Identifying New Risk Aversions and Subsequent Limitations and Biases When Making De-identified Structured Data Sets Openly Available in a Post-LLM world.

作者信息

Chen Fangyi, Cato Kenrick, Gürsoy Gamze, Dykes Patricia C, Lowenthal Graham, Rossetti Sarah

机构信息

Department of Biomedical Informatics, Columbia University, New York, NY, United States.

School of Nursing, University of Pennsylvania, Philadelphia, PA, United States.

出版信息

AMIA Annu Symp Proc. 2025 May 22;2024:262-270. eCollection 2024.

PMID:40417480
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12099381/
Abstract

Making clinical datasets openly available is critical to promote reproducibility and transparency of scientific research. Currently, few datasets are accessible to the public. To support the open science initiative, we plan to release the structured clinical datasets from the CONCERN study. In this paper, we are presenting our de-identification approaches for structured data, considering the future inclusion of de-identified narrative notes and re-identification risks in the LLM era. Through literature review and collaborative consensus sessions, our team made informed decisions regarding dataset release, weighing the pros and cons of each choice, outlining limitation and bias introduced by the de-identification algorithm. To our best knowledge, this is the first study describing the rationales of de-identification decisions in the LLMs era, delineating the consequent problems that should be considered when using our data set. We advocate for transparent disclosure of de-identification decisions and associated limitations and biases with all openly available datasets.

摘要

使临床数据集公开可用对于促进科学研究的可重复性和透明度至关重要。目前,公众能够获取的数据集很少。为了支持开放科学倡议,我们计划发布来自CONCERN研究的结构化临床数据集。在本文中,考虑到未来去识别化的叙述性记录的纳入以及大语言模型时代的重新识别风险,我们展示了针对结构化数据的去识别化方法。通过文献综述和协作共识会议,我们的团队就数据集发布做出了明智的决策,权衡了每个选择的利弊,概述了去识别化算法引入的局限性和偏差。据我们所知,这是第一项描述大语言模型时代去识别化决策基本原理的研究,阐述了使用我们的数据集时应考虑的相关问题。我们主张对所有公开可用的数据集透明披露去识别化决策以及相关的局限性和偏差。

相似文献

1
Toward Identifying New Risk Aversions and Subsequent Limitations and Biases When Making De-identified Structured Data Sets Openly Available in a Post-LLM world.在大语言模型时代,当使去识别化的结构化数据集公开可用时,致力于识别新的风险规避以及随之而来的限制和偏差。
AMIA Annu Symp Proc. 2025 May 22;2024:262-270. eCollection 2024.
2
A survey on UK researchers' views regarding their experiences with the de-identification, anonymisation, release methods and re-identification risk estimation for clinical trial datasets.一项关于英国研究人员对临床试验数据集的去识别化、匿名化、发布方法及重新识别风险评估经验的看法的调查。
Clin Trials. 2025 Feb;22(1):11-23. doi: 10.1177/17407745241259086. Epub 2024 Jun 19.
3
Use and Understanding of Anonymization and De-Identification in the Biomedical Literature: Scoping Review.生物医学文献中匿名化和去识别化的使用与理解:范围综述
J Med Internet Res. 2019 May 31;21(5):e13484. doi: 10.2196/13484.
4
Sharing traumatic stress research data: assessing and reducing the risk of re-identification.共享创伤应激研究数据:评估与降低重新识别风险
Eur J Psychotraumatol. 2025 Dec;16(1):2499296. doi: 10.1080/20008066.2025.2499296. Epub 2025 May 19.
5
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区,服用抗叶酸抗疟药物的人群中,叶酸补充剂与疟疾易感性和严重程度的关系。
Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.
6
The Costs of Anonymization: Case Study Using Clinical Data.匿名化的成本:使用临床数据的案例研究
J Med Internet Res. 2024 Apr 24;26:e49445. doi: 10.2196/49445.
7
A framework for de-identification of free-text data in electronic medical records enabling secondary use.用于电子病历中自由文本数据去识别化的框架,以支持二次使用。
Aust Health Rev. 2022 Jun;46(3):289-293. doi: 10.1071/AH21361.
8
Patient Privacy in the Era of Big Data.大数据时代的患者隐私
Balkan Med J. 2018 Jan 20;35(1):8-17. doi: 10.4274/balkanmedj.2017.0966. Epub 2017 Sep 13.
9
Design and evaluation of a data anonymization pipeline to promote Open Science on COVID-19.设计和评估 COVID-19 数据匿名化管道,以促进开放科学。
Sci Data. 2020 Dec 10;7(1):435. doi: 10.1038/s41597-020-00773-y.
10
Evaluating LLMs' Potential to Identify Rare Patient Identifiers in Patient Health Records.评估大型语言模型在患者健康记录中识别罕见患者标识符的潜力。
Stud Health Technol Inform. 2025 May 15;327:874-875. doi: 10.3233/SHTI250486.

本文引用的文献

1
Patients admitted on weekends have higher in-hospital mortality than those admitted on weekdays: Analysis of national inpatient sample.周末入院患者的院内死亡率高于工作日入院患者:全国住院患者样本分析。
Am J Med Open. 2022 Nov 22;9:100028. doi: 10.1016/j.ajmo.2022.100028. eCollection 2023 Jun.
2
Examining the Generalizability of Pretrained De-identification Transformer Models on Narrative Nursing Notes.考察预训练去识别变换模型在叙事护理记录上的泛化能力。
Appl Clin Inform. 2024 Mar;15(2):357-367. doi: 10.1055/a-2282-4340. Epub 2024 Mar 6.
3
Embracing Large Language Models for Medical Applications: Opportunities and Challenges.拥抱用于医学应用的大语言模型:机遇与挑战。
Cureus. 2023 May 21;15(5):e39305. doi: 10.7759/cureus.39305. eCollection 2023 May.
4
A guide to sharing open healthcare data under the General Data Protection Regulation.《通用数据保护条例》下开放医疗保健数据共享指南。
Sci Data. 2023 Jun 24;10(1):404. doi: 10.1038/s41597-023-02256-2.
5
Evaluating and mitigating bias in machine learning models for cardiovascular disease prediction.评估和减轻心血管疾病预测机器学习模型中的偏差。
J Biomed Inform. 2023 Feb;138:104294. doi: 10.1016/j.jbi.2023.104294. Epub 2023 Jan 24.
6
MIMIC-IV, a freely accessible electronic health record dataset.MIMIC-IV,一个可自由访问的电子健康记录数据集。
Sci Data. 2023 Jan 3;10(1):1. doi: 10.1038/s41597-022-01899-x.
7
Prediction algorithm for ICU mortality and length of stay using machine learning.使用机器学习算法预测 ICU 死亡率和住院时间。
Sci Rep. 2022 Jul 28;12(1):12912. doi: 10.1038/s41598-022-17091-5.
8
Membership inference attacks against synthetic health data.针对合成健康数据的成员推理攻击。
J Biomed Inform. 2022 Jan;125:103977. doi: 10.1016/j.jbi.2021.103977. Epub 2021 Dec 14.
9
External Validation of a Widely Implemented Proprietary Sepsis Prediction Model in Hospitalized Patients.在住院患者中验证广泛实施的专有脓毒症预测模型的外部有效性。
JAMA Intern Med. 2021 Aug 1;181(8):1065-1070. doi: 10.1001/jamainternmed.2021.2626.
10
Prediction Models for AKI in ICU: A Comparative Study.重症监护病房中急性肾损伤的预测模型:一项比较研究。
Int J Gen Med. 2021 Feb 25;14:623-632. doi: 10.2147/IJGM.S289671. eCollection 2021.