Manrique-Vallier Daniel, Reiter Jerome P
Postdoctoral Associate at the Social Science Research Institute and the Department of Statistical Science, Duke University, Durham, NC 27708-0251.
Mrs. Alexander Hehmeyer Associate Professor of Statistical Science, Duke University, Durham, NC 27708-0251.
J Am Stat Assoc. 2012 Dec 1;107(500):1385-1394. doi: 10.1080/01621459.2012.710508.
Statistical agencies and other organizations that disseminate data are obligated to protect data subjects' confidentiality. For example, ill-intentioned individuals might link data subjects to records in other databases by matching on common characteristics (keys). Successful links are particularly problematic for data subjects with combinations of keys that are unique in the population. Hence, as part of their assessments of disclosure risks, many data stewards estimate the probabilities that sample uniques on sets of discrete keys are also population uniques on those keys. This is typically done using log-linear modeling on the keys. However, log-linear models can yield biased estimates of cell probabilities for sparse contingency tables with many zero counts, which often occurs in databases with many keys. This bias can result in unreliable estimates of probabilities of uniqueness and, hence, misrepresentations of disclosure risks. We propose an alternative to log-linear models for datasets with sparse keys based on a Bayesian version of grade of membership (GoM) models. We present a Bayesian GoM model for multinomial variables and offer an MCMC algorithm for fitting the model. We evaluate the approach by treating data from a recent US Census Bureau public use microdata sample as a population, taking simple random samples from that population, and benchmarking estimated probabilities of uniqueness against population values. Compared to log-linear models, GoM models provide more accurate estimates of the total number of uniques in the samples. Additionally, they offer record-level predictions of uniqueness that dominate those based on log-linear models.
负责发布数据的统计机构和其他组织有义务保护数据主体的隐私。例如,恶意个体可能会通过匹配共同特征(键)将数据主体与其他数据库中的记录关联起来。对于那些在总体中具有唯一键组合的数据主体来说,成功的关联尤其成问题。因此,作为其对披露风险评估的一部分,许多数据管理员会估计离散键集上的样本唯一值在总体中也是唯一值的概率。这通常是通过对键进行对数线性建模来完成的。然而,对于具有许多零计数的稀疏列联表,对数线性模型可能会产生单元概率的偏差估计,这种情况在具有许多键的数据库中经常出现。这种偏差可能导致唯一性概率的不可靠估计,从而错误地表示披露风险。我们针对具有稀疏键的数据集提出了一种基于贝叶斯隶属度等级(GoM)模型的对数线性模型替代方法。我们提出了一种用于多项变量的贝叶斯GoM模型,并提供了一种用于拟合该模型的MCMC算法。我们通过将美国人口普查局最近的公共使用微观数据样本中的数据视为总体,从该总体中进行简单随机抽样,并将估计的唯一性概率与总体值进行基准比较来评估该方法。与对数线性模型相比,GoM模型对样本中唯一值的总数提供了更准确的估计。此外,它们还提供了基于记录级别的唯一性预测,这些预测优于基于对数线性模型的预测。