• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

利用生成模型估计不完全数据集重识别的成功率。

Estimating the success of re-identifications in incomplete datasets using generative models.

机构信息

Information and Communication Technologies, Electronics and Applied Mathematics (ICTEAM), Université catholique de Louvain, B-1348, Louvain-la-Neuve, Belgium.

Department of Computing, Imperial College London, London, SW7 2AZ, UK.

出版信息

Nat Commun. 2019 Jul 23;10(1):3069. doi: 10.1038/s41467-019-10933-3.

DOI:10.1038/s41467-019-10933-3
PMID:31337762
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6650473/
Abstract

While rich medical, behavioral, and socio-demographic data are key to modern data-driven research, their collection and use raise legitimate privacy concerns. Anonymizing datasets through de-identification and sampling before sharing them has been the main tool used to address those concerns. We here propose a generative copula-based method that can accurately estimate the likelihood of a specific person to be correctly re-identified, even in a heavily incomplete dataset. On 210 populations, our method obtains AUC scores for predicting individual uniqueness ranging from 0.84 to 0.97, with low false-discovery rate. Using our model, we find that 99.98% of Americans would be correctly re-identified in any dataset using 15 demographic attributes. Our results suggest that even heavily sampled anonymized datasets are unlikely to satisfy the modern standards for anonymization set forth by GDPR and seriously challenge the technical and legal adequacy of the de-identification release-and-forget model.

摘要

虽然丰富的医学、行为和社会人口统计学数据是现代数据驱动研究的关键,但它们的收集和使用引发了合理的隐私担忧。在共享之前,通过去识别和抽样对数据集进行匿名化是解决这些问题的主要工具。我们在这里提出了一种基于生成式 Copula 的方法,可以准确估计特定个体被正确重新识别的可能性,即使在严重不完整的数据集也是如此。在 210 个人群中,我们的方法对于预测个体独特性的 AUC 得分从 0.84 到 0.97 不等,假阳性率很低。使用我们的模型,我们发现,在使用 15 个人口统计学属性的任何数据集,99.98%的美国人都可以被正确重新识别。我们的研究结果表明,即使是经过大量抽样的匿名化数据集,也不太可能满足 GDPR 规定的现代匿名化标准,并严重挑战去识别即发布和遗忘模型的技术和法律充分性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b32c/6650473/1f947bb292b2/41467_2019_10933_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b32c/6650473/a68521cd5d29/41467_2019_10933_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b32c/6650473/a9c435480c08/41467_2019_10933_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b32c/6650473/1f947bb292b2/41467_2019_10933_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b32c/6650473/a68521cd5d29/41467_2019_10933_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b32c/6650473/a9c435480c08/41467_2019_10933_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b32c/6650473/1f947bb292b2/41467_2019_10933_Fig3_HTML.jpg

相似文献

1
Estimating the success of re-identifications in incomplete datasets using generative models.利用生成模型估计不完全数据集重识别的成功率。
Nat Commun. 2019 Jul 23;10(1):3069. doi: 10.1038/s41467-019-10933-3.
2
Preventing Unintended Disclosure of Personally Identifiable Data Following Anonymisation.防止匿名化后个人身份信息的意外泄露。
Stud Health Technol Inform. 2017;235:313-317.
3
Criminal Prohibition of Wrongful Re‑identification: Legal Solution or Minefield for Big Data?对不当重新识别的刑事禁止:法律解决方案还是大数据的雷区?
J Bioeth Inq. 2017 Dec;14(4):527-539. doi: 10.1007/s11673-017-9806-9. Epub 2017 Sep 14.
4
Use and Understanding of Anonymization and De-Identification in the Biomedical Literature: Scoping Review.生物医学文献中匿名化和去识别化的使用与理解:范围综述
J Med Internet Res. 2019 May 31;21(5):e13484. doi: 10.2196/13484.
5
Proposal and Assessment of a De-Identification Strategy to Enhance Anonymity of the Observational Medical Outcomes Partnership Common Data Model (OMOP-CDM) in a Public Cloud-Computing Environment: Anonymization of Medical Data Using Privacy Models.在公共云计算环境中增强观察性医疗结局伙伴关系通用数据模型(OMOP-CDM)匿名性的去标识策略的提出与评估:使用隐私模型对医疗数据进行匿名化。
J Med Internet Res. 2020 Nov 26;22(11):e19597. doi: 10.2196/19597.
6
Anonymization for outputs of population health and health services research conducted via an online data center.通过在线数据中心进行的人群健康与卫生服务研究产出的匿名化处理。
J Am Med Inform Assoc. 2017 May 1;24(3):544-549. doi: 10.1093/jamia/ocw152.
7
The cost of quality: Implementing generalization and suppression for anonymizing biomedical data with minimal information loss.质量成本:在信息损失最小化的情况下,对生物医学数据进行匿名化处理时实施泛化和抑制。
J Biomed Inform. 2015 Dec;58:37-48. doi: 10.1016/j.jbi.2015.09.007. Epub 2015 Sep 15.
8
Diversity-Aware Anonymization for Structured Health Data.面向结构化健康数据的多样性感知匿名化。
Annu Int Conf IEEE Eng Med Biol Soc. 2021 Nov;2021:2148-2154. doi: 10.1109/EMBC46164.2021.9629918.
9
A flexible approach to distributed data anonymization.一种灵活的分布式数据匿名化方法。
J Biomed Inform. 2014 Aug;50:62-76. doi: 10.1016/j.jbi.2013.12.002. Epub 2013 Dec 12.
10
Lost in Anonymization - A Data Anonymization Reference Classification Merging Legal and Technical Considerations.迷失在匿名化中——融合法律与技术考量的数据匿名化参考分类
J Law Med Ethics. 2020 Mar;48(1):228-231. doi: 10.1177/1073110520917025.

引用本文的文献

1
Regulating genome language models: navigating policy challenges at the intersection of AI and genetics.规范基因组语言模型:应对人工智能与遗传学交叉领域的政策挑战
Hum Genet. 2025 Sep 16. doi: 10.1007/s00439-025-02768-4.
2
A blockchain-enabled healthcare system for cervical cancer risk prediction using enhanced metaheuristic optimised graph convolutional attention based GRU.一种基于增强型元启发式优化图卷积注意力的门控循环单元的、用于宫颈癌风险预测的区块链支持的医疗保健系统。
MethodsX. 2025 Aug 16;15:103564. doi: 10.1016/j.mex.2025.103564. eCollection 2025 Dec.
3
Multidimensional social signature de-anonymizes low-sensitivity data.

本文引用的文献

1
Secure genome-wide association analysis using multiparty computation.使用多方计算进行安全的全基因组关联分析。
Nat Biotechnol. 2018 Jul;36(6):547-551. doi: 10.1038/nbt.4108. Epub 2018 May 7.
2
Comment on "Unique in the shopping mall: On the reidentifiability of credit card metadata".评“购物中心的独特之处:关于信用卡元数据的可再识别性”。
Science. 2016 Mar 18;351(6279):1274. doi: 10.1126/science.aad9295.
3
Big data in global health: improving health in low- and middle-income countries.全球健康领域的大数据:改善低收入和中等收入国家的健康状况
多维社会特征使低敏感度数据去匿名化。
Sci Rep. 2025 Aug 29;15(1):31916. doi: 10.1038/s41598-025-16663-5.
4
Determinants of Continuous Smartwatch Use and Data-Sharing Preferences With Physicians, Public Health Authorities, and Private Companies: Cross-Sectional Survey of Smartwatch Users.持续使用智能手表以及与医生、公共卫生当局和私人公司共享数据偏好的决定因素:智能手表用户的横断面调查
J Med Internet Res. 2025 Aug 18;27:e67414. doi: 10.2196/67414.
5
Demystifying the likelihood of reidentification in neuroimaging data: A technical and regulatory analysis.揭开神经影像数据中重新识别可能性的神秘面纱:一项技术与监管分析。
Imaging Neurosci (Camb). 2024 Mar 22;2. doi: 10.1162/imag_a_00111. eCollection 2024.
6
Empowering standardized residency training in China through large language models: problem analysis and solutions.通过大语言模型推动中国住院医师规范化培训:问题分析与解决方案
Ann Med. 2025 Dec;57(1):2516695. doi: 10.1080/07853890.2025.2516695. Epub 2025 Jul 15.
7
The ethics of data mining in healthcare: challenges, frameworks, and future directions.医疗保健领域数据挖掘的伦理问题:挑战、框架及未来方向。
BioData Min. 2025 Jul 11;18(1):47. doi: 10.1186/s13040-025-00461-w.
8
Fusion of Personalized Federated Learning (PFL) with Differential Privacy (DP) Learning for Diagnosis of Arrhythmia Disease.个性化联邦学习(PFL)与差分隐私(DP)学习相结合用于心律失常疾病诊断
PLoS One. 2025 Jul 11;20(7):e0327108. doi: 10.1371/journal.pone.0327108. eCollection 2025.
9
Transparency in epidemiological analyses of cohort data a case study of the Norwegian mother, father, and child cohort study (MoBa).队列数据流行病学分析中的透明度——以挪威母婴队列研究(MoBa)为例
BMC Med Res Methodol. 2025 Jul 1;25(1):171. doi: 10.1186/s12874-025-02601-6.
10
Pseudonymisation of neuroimages and data protection: .神经影像的假名化与数据保护:
Neuroimage Rep. 2021 Sep 15;1(4):100053. doi: 10.1016/j.ynirp.2021.100053. eCollection 2021 Dec.
Bull World Health Organ. 2015 Mar 1;93(3):203-8. doi: 10.2471/BLT.14.139022. Epub 2015 Jan 30.
4
Identity and privacy. Unique in the shopping mall: on the reidentifiability of credit card metadata.身份与隐私。购物中心里的独特之处:信用卡元数据的可再识别性。
Science. 2015 Jan 30;347(6221):536-9. doi: 10.1126/science.1256297.
5
The inevitable application of big data to health care.大数据在医疗保健领域的必然应用。
JAMA. 2013 Apr 3;309(13):1351-2. doi: 10.1001/jama.2013.393.
6
Unique in the Crowd: The privacy bounds of human mobility.独一无二的人群:人类流动的隐私边界。
Sci Rep. 2013;3:1376. doi: 10.1038/srep01376.
7
Big data: the management revolution.大数据:管理革命。
Harv Bus Rev. 2012 Oct;90(10):60-6, 68, 128.
8
Estimating the re-identification risk of clinical data sets.估算临床数据集的再识别风险。
BMC Med Inform Decis Mak. 2012 Jul 9;12:66. doi: 10.1186/1472-6947-12-66.
9
Never too old for anonymity: a statistical standard for demographic data sharing via the HIPAA Privacy Rule.永远不要因为年龄而放弃匿名:通过 HIPAA 隐私规则共享人口统计数据的统计标准。
J Am Med Inform Assoc. 2011 Jan-Feb;18(1):3-10. doi: 10.1136/jamia.2010.004622.
10
Is deidentification sufficient to protect health privacy in research?去识别化是否足以在研究中保护健康隐私?
Am J Bioeth. 2010 Sep;10(9):3-11. doi: 10.1080/15265161.2010.494215.