Suppr超能文献

共享临床数据集参与者的重新识别:实验研究

Reidentification of Participants in Shared Clinical Data Sets: Experimental Study.

作者信息

Wiepert Daniela, Malin Bradley A, Duffy Joseph R, Utianski Rene L, Stricker John L, Jones David T, Botha Hugo

机构信息

Department of Neurology, Mayo Clinic, Rochester, MN, United States.

Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, United States.

出版信息

JMIR AI. 2024 Mar 15;3:e52054. doi: 10.2196/52054.

Abstract

BACKGROUND

Large curated data sets are required to leverage speech-based tools in health care. These are costly to produce, resulting in increased interest in data sharing. As speech can potentially identify speakers (ie, voiceprints), sharing recordings raises privacy concerns. This is especially relevant when working with patient data protected under the Health Insurance Portability and Accountability Act.

OBJECTIVE

We aimed to determine the reidentification risk for speech recordings, without reference to demographics or metadata, in clinical data sets considering both the size of the search space (ie, the number of comparisons that must be considered when reidentifying) and the nature of the speech recording (ie, the type of speech task).

METHODS

Using a state-of-the-art speaker identification model, we modeled an adversarial attack scenario in which an adversary uses a large data set of identified speech (hereafter, the known set) to reidentify as many unknown speakers in a shared data set (hereafter, the unknown set) as possible. We first considered the effect of search space size by attempting reidentification with various sizes of known and unknown sets using VoxCeleb, a data set with recordings of natural, connected speech from >7000 healthy speakers. We then repeated these tests with different types of recordings in each set to examine whether the nature of a speech recording influences reidentification risk. For these tests, we used our clinical data set composed of recordings of elicited speech tasks from 941 speakers.

RESULTS

We found that the risk was inversely related to the number of comparisons an adversary must consider (ie, the search space), with a positive linear correlation between the number of false acceptances (FAs) and the number of comparisons (r=0.69; P<.001). The true acceptances (TAs) stayed relatively stable, and the ratio between FAs and TAs rose from 0.02 at 1 × 10 comparisons to 1.41 at 6 × 10 comparisons, with a near 1:1 ratio at the midpoint of 3 × 10 comparisons. In effect, risk was high for a small search space but dropped as the search space grew. We also found that the nature of a speech recording influenced reidentification risk, with nonconnected speech (eg, vowel prolongation: FA/TA=98.5; alternating motion rate: FA/TA=8) being harder to identify than connected speech (eg, sentence repetition: FA/TA=0.54) in cross-task conditions. The inverse was mostly true in within-task conditions, with the FA/TA ratio for vowel prolongation and alternating motion rate dropping to 0.39 and 1.17, respectively.

CONCLUSIONS

Our findings suggest that speaker identification models can be used to reidentify participants in specific circumstances, but in practice, the reidentification risk appears small. The variation in risk due to search space size and type of speech task provides actionable recommendations to further increase participant privacy and considerations for policy regarding public release of speech recordings.

摘要

背景

在医疗保健领域利用基于语音的工具需要大型的经过整理的数据集。这些数据集的制作成本高昂,导致人们对数据共享的兴趣增加。由于语音有可能识别说话者(即声纹),共享录音引发了隐私担忧。在处理受《健康保险流通与责任法案》保护的患者数据时,这一点尤为重要。

目的

我们旨在确定临床数据集中语音录音的重新识别风险,不考虑人口统计学或元数据,同时考虑搜索空间的大小(即重新识别时必须考虑的比较次数)和语音录音的性质(即语音任务的类型)。

方法

使用一种先进的说话者识别模型,我们模拟了一种对抗性攻击场景,即对手使用一个已识别语音的大型数据集(以下简称已知集),尽可能多地重新识别共享数据集中的未知说话者(以下简称未知集)。我们首先通过使用VoxCeleb(一个包含来自7000多名健康说话者的自然、连贯语音录音的数据集),尝试用不同大小的已知集和未知集进行重新识别,来考虑搜索空间大小的影响。然后,我们在每组中使用不同类型的录音重复这些测试,以检查语音录音的性质是否会影响重新识别风险。对于这些测试,我们使用了由941名说话者的诱发语音任务录音组成的临床数据集。

结果

我们发现风险与对手必须考虑的比较次数(即搜索空间)呈负相关,错误接受(FA)次数与比较次数之间呈正线性相关(r = 0.69;P <.001)。正确接受(TA)保持相对稳定,FA与TA的比率从1×10次比较时的0.02上升到6×10次比较时的1.41,在3×10次比较的中点处接近1:1的比率。实际上,对于较小的搜索空间,风险较高,但随着搜索空间的增大而降低。我们还发现语音录音的性质会影响重新识别风险,在跨任务条件下,非连贯语音(例如元音延长:FA/TA = 98.5;交替运动率:FA/TA = 8)比连贯语音(例如句子重复:FA/TA = 0.54)更难识别。在任务内条件下,情况大多相反,元音延长和交替运动率的FA/TA比率分别降至0.39和1.17。

结论

我们的研究结果表明,说话者识别模型可用于在特定情况下重新识别参与者,但在实践中,重新识别风险似乎较小。由于搜索空间大小和语音任务类型导致的风险变化提供了可行的建议,以进一步提高参与者隐私,并为语音录音公开发布的政策考量提供参考。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/65ba/11041495/2358b51bc8eb/ai_v3i1e52054_fig1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验