Children's Hospital of Eastern Ontario Research Institute, 401 Smyth Road, Ottawa, Ontario K1J 8L1, Canada.
BMC Med Inform Decis Mak. 2010 Apr 2;10:18. doi: 10.1186/1472-6947-10-18.
A common disclosure control practice for health datasets is to identify small geographic areas and either suppress records from these small areas or aggregate them into larger ones. A recent study provided a method for deciding when an area is too small based on the uniqueness criterion. The uniqueness criterion stipulates that an the area is no longer too small when the proportion of unique individuals on the relevant variables (the quasi-identifiers) approaches zero. However, using a uniqueness value of zero is quite a stringent threshold, and is only suitable when the risks from data disclosure are quite high. Other uniqueness thresholds that have been proposed for health data are 5% and 20%.
We estimated uniqueness for urban Forward Sortation Areas (FSAs) by using the 2001 long form Canadian census data representing 20% of the population. We then constructed two logistic regression models to predict when the uniqueness is greater than the 5% and 20% thresholds, and validated their predictive accuracy using 10-fold cross-validation. Predictor variables included the population size of the FSA and the maximum number of possible values on the quasi-identifiers (the number of equivalence classes).
All model parameters were significant and the models had very high prediction accuracy, with specificity above 0.9, and sensitivity at 0.87 and 0.74 for the 5% and 20% threshold models respectively. The application of the models was illustrated with an analysis of the Ontario newborn registry and an emergency department dataset. At the higher thresholds considerably fewer records compared to the 0% threshold would be considered to be in small areas and therefore undergo disclosure control actions. We have also included concrete guidance for data custodians in deciding which one of the three uniqueness thresholds to use (0%, 5%, 20%), depending on the mitigating controls that the data recipients have in place, the potential invasion of privacy if the data is disclosed, and the motives and capacity of the data recipient to re-identify the data.
The models we developed can be used to manage the re-identification risk from small geographic areas. Being able to choose among three possible thresholds, a data custodian can adjust the definition of "small geographic area" to the nature of the data and recipient.
对于医疗数据集,一种常见的披露控制实践是标识小的地理区域,要么从这些小区域中删除记录,要么将它们汇总到更大的区域中。最近的一项研究提供了一种基于独特性标准来确定区域是否过小的方法。独特性标准规定,当相关变量(准标识符)上的唯一个体比例接近零时,该区域不再过小。然而,使用零的独特性值是一个相当严格的阈值,仅当数据披露的风险相当高时才适用。其他针对医疗数据提出的独特性阈值为 5%和 20%。
我们使用代表 20%人口的 2001 年加拿大长式普查数据来估计城市 FSA 的独特性。然后,我们构建了两个逻辑回归模型来预测独特性何时大于 5%和 20%的阈值,并使用 10 折交叉验证来验证其预测准确性。预测变量包括 FSA 的人口规模和准标识符上可能的值的最大值(等价类的数量)。
所有模型参数均具有统计学意义,并且模型具有非常高的预测准确性,特异性均高于 0.9,对于 5%和 20%阈值模型,敏感性分别为 0.87 和 0.74。通过对安大略省新生儿登记处和急诊科数据集的分析,说明了模型的应用。在较高的阈值下,与 0%阈值相比,将有相当少的记录被认为是在小区域内,因此将进行披露控制操作。我们还为数据保管人提供了具体的指导,以根据数据接收方已实施的减轻控制措施、数据披露可能侵犯的隐私、数据接收方重新识别数据的动机和能力,来决定使用三个独特性阈值之一(0%、5%、20%)。
我们开发的模型可用于管理来自小地理区域的重新识别风险。数据保管人可以通过在三种可能的阈值之间进行选择,根据数据和接收方的性质来调整“小地理区域”的定义。