基于聚类的匿名数据实用驱动评估。

Utility-driven assessment of anonymized data via clustering.

机构信息

Universidade da Beira Interior, Covilha, Portugal and CEMAPRE, Lisboa, Portugal.

Universidade da Beira Interior and Instituto de Telecomunicações (IT-UBI), Covilha, Portugal.

出版信息

Sci Data. 2022 Jul 30;9(1):456. doi: 10.1038/s41597-022-01561-6.

DOI:10.1038/s41597-022-01561-6

PMID:35907927

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9339002/

Abstract

In this study, clustering is conceived as an auxiliary tool to identify groups of special interest. This approach was applied to a real dataset concerning an entire Portuguese cohort of higher education Law students. Several anonymized clustering scenarios were compared against the original cluster solution. The clustering techniques were explored as data utility models in the context of data anonymization, using k-anonymity and (ε, δ)-differential as privacy models. The purpose was to assess anonymized data utility by standard metrics, by the characteristics of the groups obtained, and the relative risk (a relevant metric in social sciences research). For a matter of self-containment, we present an overview of anonymization and clustering methods. We used a partitional clustering algorithm and analyzed several clustering validity indices to understand to what extent the data structure is preserved, or not, after data anonymization. The results suggest that for low dimensionality/cardinality datasets the anonymization procedure easily jeopardizes the clustering endeavor. In addition, there is evidence that relevant field-of-study estimates obtained from anonymized data are biased.

摘要

在这项研究中，聚类被视为一种辅助工具，用于识别特殊兴趣群体。这种方法应用于一个关于整个葡萄牙高等教育法律学生队列的真实数据集。针对原始聚类解决方案，对几个匿名聚类场景进行了比较。在数据匿名化的上下文中，使用 k-匿名和 (ε, δ)-差分作为隐私模型，探索了聚类技术作为数据实用模型。目的是通过标准指标、获得的组的特征以及相对风险（社会科学研究中的一个相关指标）来评估匿名数据的实用性。为了自成一体，我们对匿名化和聚类方法进行了概述。我们使用了分区聚类算法，并分析了几个聚类有效性指标，以了解在数据匿名化后，数据结构在多大程度上得到了保留或未得到保留。结果表明，对于低维/基数数据集，匿名化过程很容易危及聚类工作。此外，有证据表明，从匿名数据中获得的相关研究领域估计值存在偏差。