Tournier Arnaud J, de Montjoye Yves-Alexandre
Department of Computing, Imperial College London, London, UK.
Data Science Institute, Imperial College London, London, UK.
Sci Adv. 2022 Aug 19;8(33):eabl6464. doi: 10.1126/sciadv.abl6464.
Behavioral data, collected from our daily interactions with technology, have driven scientific advances. Yet, the collection and sharing of this data raise legitimate privacy concerns, as individuals can often be reidentified. Current identification attacks, however, require auxiliary information to roughly match the information available in the dataset, limiting their applicability. We here propose an entropy-based profiling model to learn time-persistent profiles. Using auxiliary information about a single target collected over a nonoverlapping time period, we show that individuals are correctly identified 79% of the time in a large location dataset of 0.5 million individuals and 65.2% for a grocery shopping dataset of 85,000 individuals. We further show that accuracy only slowly decreases over time and that the model is robust to state-of-the-art noise addition. Our results show that much more auxiliary information than previously believed can be used to identify individuals, challenging deidentification practices and what currently constitutes legally anonymous data.
从我们与技术的日常交互中收集的行为数据推动了科学进步。然而,这些数据的收集和共享引发了合理的隐私担忧,因为个人往往能够被重新识别。然而,当前的识别攻击需要辅助信息来大致匹配数据集中可用的信息,这限制了它们的适用性。我们在此提出一种基于熵的剖析模型来学习随时间持续的剖析。利用在不重叠时间段内收集的关于单个目标的辅助信息,我们表明,在一个包含50万个体的大型位置数据集中,79%的个体能被正确识别,在一个包含85000个体的杂货店购物数据集中,这一比例为65.2%。我们进一步表明,随着时间推移,准确率只是缓慢下降,并且该模型对最先进的噪声添加具有鲁棒性。我们的结果表明,可用于识别个体的辅助信息比之前认为的要多得多,这对去识别化实践以及目前构成合法匿名数据的内容提出了挑战。