• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

共享临床数据集参与者的重新识别:实验研究

Reidentification of Participants in Shared Clinical Data Sets: Experimental Study.

作者信息

Wiepert Daniela, Malin Bradley A, Duffy Joseph R, Utianski Rene L, Stricker John L, Jones David T, Botha Hugo

机构信息

Department of Neurology, Mayo Clinic, Rochester, MN, United States.

Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, United States.

出版信息

JMIR AI. 2024 Mar 15;3:e52054. doi: 10.2196/52054.

DOI:10.2196/52054
PMID:38875581
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11041495/
Abstract

BACKGROUND

Large curated data sets are required to leverage speech-based tools in health care. These are costly to produce, resulting in increased interest in data sharing. As speech can potentially identify speakers (ie, voiceprints), sharing recordings raises privacy concerns. This is especially relevant when working with patient data protected under the Health Insurance Portability and Accountability Act.

OBJECTIVE

We aimed to determine the reidentification risk for speech recordings, without reference to demographics or metadata, in clinical data sets considering both the size of the search space (ie, the number of comparisons that must be considered when reidentifying) and the nature of the speech recording (ie, the type of speech task).

METHODS

Using a state-of-the-art speaker identification model, we modeled an adversarial attack scenario in which an adversary uses a large data set of identified speech (hereafter, the known set) to reidentify as many unknown speakers in a shared data set (hereafter, the unknown set) as possible. We first considered the effect of search space size by attempting reidentification with various sizes of known and unknown sets using VoxCeleb, a data set with recordings of natural, connected speech from >7000 healthy speakers. We then repeated these tests with different types of recordings in each set to examine whether the nature of a speech recording influences reidentification risk. For these tests, we used our clinical data set composed of recordings of elicited speech tasks from 941 speakers.

RESULTS

We found that the risk was inversely related to the number of comparisons an adversary must consider (ie, the search space), with a positive linear correlation between the number of false acceptances (FAs) and the number of comparisons (r=0.69; P<.001). The true acceptances (TAs) stayed relatively stable, and the ratio between FAs and TAs rose from 0.02 at 1 × 10 comparisons to 1.41 at 6 × 10 comparisons, with a near 1:1 ratio at the midpoint of 3 × 10 comparisons. In effect, risk was high for a small search space but dropped as the search space grew. We also found that the nature of a speech recording influenced reidentification risk, with nonconnected speech (eg, vowel prolongation: FA/TA=98.5; alternating motion rate: FA/TA=8) being harder to identify than connected speech (eg, sentence repetition: FA/TA=0.54) in cross-task conditions. The inverse was mostly true in within-task conditions, with the FA/TA ratio for vowel prolongation and alternating motion rate dropping to 0.39 and 1.17, respectively.

CONCLUSIONS

Our findings suggest that speaker identification models can be used to reidentify participants in specific circumstances, but in practice, the reidentification risk appears small. The variation in risk due to search space size and type of speech task provides actionable recommendations to further increase participant privacy and considerations for policy regarding public release of speech recordings.

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/65ba/11041495/ba498874b4b1/ai_v3i1e52054_fig5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/65ba/11041495/2358b51bc8eb/ai_v3i1e52054_fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/65ba/11041495/13e74047abab/ai_v3i1e52054_fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/65ba/11041495/0dc5bd75088e/ai_v3i1e52054_fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/65ba/11041495/a588eee61325/ai_v3i1e52054_fig4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/65ba/11041495/ba498874b4b1/ai_v3i1e52054_fig5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/65ba/11041495/2358b51bc8eb/ai_v3i1e52054_fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/65ba/11041495/13e74047abab/ai_v3i1e52054_fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/65ba/11041495/0dc5bd75088e/ai_v3i1e52054_fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/65ba/11041495/a588eee61325/ai_v3i1e52054_fig4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/65ba/11041495/ba498874b4b1/ai_v3i1e52054_fig5.jpg
摘要

背景

在医疗保健领域利用基于语音的工具需要大型的经过整理的数据集。这些数据集的制作成本高昂,导致人们对数据共享的兴趣增加。由于语音有可能识别说话者(即声纹),共享录音引发了隐私担忧。在处理受《健康保险流通与责任法案》保护的患者数据时,这一点尤为重要。

目的

我们旨在确定临床数据集中语音录音的重新识别风险,不考虑人口统计学或元数据,同时考虑搜索空间的大小(即重新识别时必须考虑的比较次数)和语音录音的性质(即语音任务的类型)。

方法

使用一种先进的说话者识别模型,我们模拟了一种对抗性攻击场景,即对手使用一个已识别语音的大型数据集(以下简称已知集),尽可能多地重新识别共享数据集中的未知说话者(以下简称未知集)。我们首先通过使用VoxCeleb(一个包含来自7000多名健康说话者的自然、连贯语音录音的数据集),尝试用不同大小的已知集和未知集进行重新识别,来考虑搜索空间大小的影响。然后,我们在每组中使用不同类型的录音重复这些测试,以检查语音录音的性质是否会影响重新识别风险。对于这些测试,我们使用了由941名说话者的诱发语音任务录音组成的临床数据集。

结果

我们发现风险与对手必须考虑的比较次数(即搜索空间)呈负相关,错误接受(FA)次数与比较次数之间呈正线性相关(r = 0.69;P <.001)。正确接受(TA)保持相对稳定,FA与TA的比率从1×10次比较时的0.02上升到6×10次比较时的1.41,在3×10次比较的中点处接近1:1的比率。实际上,对于较小的搜索空间,风险较高,但随着搜索空间的增大而降低。我们还发现语音录音的性质会影响重新识别风险,在跨任务条件下,非连贯语音(例如元音延长:FA/TA = 98.5;交替运动率:FA/TA = 8)比连贯语音(例如句子重复:FA/TA = 0.54)更难识别。在任务内条件下,情况大多相反,元音延长和交替运动率的FA/TA比率分别降至0.39和1.17。

结论

我们的研究结果表明,说话者识别模型可用于在特定情况下重新识别参与者,但在实践中,重新识别风险似乎较小。由于搜索空间大小和语音任务类型导致的风险变化提供了可行的建议,以进一步提高参与者隐私,并为语音录音公开发布的政策考量提供参考。

相似文献

1
Reidentification of Participants in Shared Clinical Data Sets: Experimental Study.共享临床数据集参与者的重新识别:实验研究
JMIR AI. 2024 Mar 15;3:e52054. doi: 10.2196/52054.
2
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区,服用抗叶酸抗疟药物的人群中,叶酸补充剂与疟疾易感性和严重程度的关系。
Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.
3
Data privacy protection in scientific publications: process implementation at a pharmaceutical company.科学出版物中的数据隐私保护:制药公司的实施过程。
BMC Med Ethics. 2022 Jun 25;23(1):65. doi: 10.1186/s12910-022-00804-w.
4
Acoustic vowel analysis and speech intelligibility in young adult Hebrew speakers: Developmental dysarthria versus typical development.青年希伯来语说话者的声学元音分析和言语可懂度:发育性构音障碍与典型发育。
Int J Lang Commun Disord. 2021 Mar;56(2):283-298. doi: 10.1111/1460-6984.12598. Epub 2021 Feb 1.
5
Protecting Privacy in Large Datasets-First We Assess the Risk; Then We Fuzzy the Data.在大型数据集保护隐私 - 首先我们评估风险;然后我们模糊数据。
Cancer Epidemiol Biomarkers Prev. 2017 Aug 1;26(8):1219-1224. doi: 10.1158/1055-9965.EPI-17-0172. Epub 2017 Jul 28.
6
The project data sphere initiative: accelerating cancer research by sharing data.项目数据领域计划:通过数据共享加速癌症研究
Oncologist. 2015 May;20(5):464-e20. doi: 10.1634/theoncologist.2014-0431. Epub 2015 Apr 15.
7
The Costs of Anonymization: Case Study Using Clinical Data.匿名化的成本:使用临床数据的案例研究
J Med Internet Res. 2024 Apr 24;26:e49445. doi: 10.2196/49445.
8
Evaluation of Privacy Risks of Patients' Data in China: Case Study.中国患者数据隐私风险评估:案例研究
JMIR Med Inform. 2020 Feb 5;8(2):e13046. doi: 10.2196/13046.
9
Data Anonymization for Pervasive Health Care: Systematic Literature Mapping Study.普及医疗保健中的数据匿名化:系统文献映射研究
JMIR Med Inform. 2021 Oct 15;9(10):e29871. doi: 10.2196/29871.
10
Privacy, Trust, and Data Sharing in Web-Based and Mobile Research: Participant Perspectives in a Large Nationwide Sample of Men Who Have Sex With Men in the United States.基于网络和移动设备的研究中的隐私、信任与数据共享:美国全国范围内大量男男性行为者样本中的参与者观点
J Med Internet Res. 2018 Jul 4;20(7):e233. doi: 10.2196/jmir.9019.

本文引用的文献

1
The effect of speech pathology on automatic speaker verification: a large-scale study.语音病理学对自动说话人验证的影响:一项大规模研究。
Sci Rep. 2023 Nov 22;13(1):20476. doi: 10.1038/s41598-023-47711-7.
2
Acoustic Change Over Time in Spastic and/or Flaccid Dysarthria in Motor Neuron Diseases.运动神经元疾病中痉挛性和/或弛缓性构音障碍的声学随时间变化
J Speech Lang Hear Res. 2022 May 11;65(5):1767-1783. doi: 10.1044/2022_JSLHR-21-00434. Epub 2022 Apr 12.
3
Noninvasive Voice Biomarker Is Associated With Incident Coronary Artery Disease Events at Follow-up.
非侵入性语音生物标志物与随访时发生的冠状动脉疾病事件相关。
Mayo Clin Proc. 2022 May;97(5):835-846. doi: 10.1016/j.mayocp.2021.10.024. Epub 2022 Mar 24.
4
Re-identification of individuals in genomic datasets using public face images.利用公开面部图像对基因组数据集中的个体进行重新识别。
Sci Adv. 2021 Nov 19;7(47):eabg3296. doi: 10.1126/sciadv.abg3296. Epub 2021 Nov 17.
5
Voice for Health: The Use of Vocal Biomarkers from Research to Clinical Practice.健康之声:从研究到临床实践的嗓音生物标志物应用
Digit Biomark. 2021 Apr 16;5(1):78-88. doi: 10.1159/000515346. eCollection 2021 Jan-Apr.
6
Noninvasive Vocal Biomarker is Associated With Severe Acute Respiratory Syndrome Coronavirus 2 Infection.非侵入性嗓音生物标志物与严重急性呼吸综合征冠状病毒2感染相关。
Mayo Clin Proc Innov Qual Outcomes. 2021 Jun;5(3):654-662. doi: 10.1016/j.mayocpiqo.2021.05.007. Epub 2021 May 14.
7
Federated learning: a collaborative effort to achieve better medical imaging models for individual sites that have small labelled datasets.联邦学习:为拥有少量标注数据集的各个站点构建更好的医学成像模型而进行的协作努力。
Quant Imaging Med Surg. 2021 Feb;11(2):852-857. doi: 10.21037/qims-20-595.
8
Enabling realistic health data re-identification risk assessment through adversarial modeling.通过对抗建模实现现实健康数据重新识别风险评估。
J Am Med Inform Assoc. 2021 Mar 18;28(4):744-752. doi: 10.1093/jamia/ocaa327.
9
The ethical questions that haunt facial-recognition research.困扰面部识别研究的伦理问题。
Nature. 2020 Nov;587(7834):354-358. doi: 10.1038/d41586-020-03187-3.
10
Burden of Neurological Disorders Across the US From 1990-2017: A Global Burden of Disease Study.《1990-2017 年美国神经障碍疾病负担:全球疾病负担研究》
JAMA Neurol. 2021 Feb 1;78(2):165-176. doi: 10.1001/jamaneurol.2020.4152.