Department of Human Development and Family Studies, The Pennsylvania State University, University Park, PA 16802;
Department of Public Health, University of Texas at San Antonio, San Antonio, TX 78249.
Proc Natl Acad Sci U S A. 2020 Jun 16;117(24):13405-13412. doi: 10.1073/pnas.2003714117. Epub 2020 May 28.
The application of a currently proposed differential privacy algorithm to the 2020 United States Census data and additional data products may affect the usefulness of these data, the accuracy of estimates and rates derived from them, and critical knowledge about social phenomena such as health disparities. We test the ramifications of applying differential privacy to released data by studying estimates of US mortality rates for the overall population and three major racial/ethnic groups. We ask how changes in the denominators of these vital rates due to the implementation of differential privacy can lead to biased estimates. We situate where these changes are most likely to matter by disaggregating biases by population size, degree of urbanization, and adjacency to a metropolitan area. Our results suggest that differential privacy will more strongly affect mortality rate estimates for non-Hispanic blacks and Hispanics than estimates for non-Hispanic whites. We also find significant changes in estimated mortality rates for less populous areas, with more pronounced changes when stratified by race/ethnicity. We find larger changes in estimated mortality rates for areas with lower levels of urbanization or adjacency to metropolitan areas, with these changes being greater for non-Hispanic blacks and Hispanics. These findings highlight the consequences of implementing differential privacy, as proposed, for research examining population composition, particularly mortality disparities across racial/ethnic groups and along the urban/rural continuum. Overall, they demonstrate the challenges in using the data products derived from the proposed disclosure avoidance methods, while highlighting critical instances where scientific understandings may be negatively impacted.
目前提出的差分隐私算法在 2020 年美国人口普查数据和其他数据产品中的应用,可能会影响这些数据的可用性、从这些数据中得出的估计值和比率的准确性,以及关于健康差距等社会现象的关键知识。我们通过研究美国全人群和三个主要种族/族裔的总死亡率估计值,来测试差分隐私在已发布数据中的应用所带来的影响。我们询问由于实施差分隐私而导致这些重要比率的分母发生变化,如何导致有偏差的估计值。我们通过按人口规模、城市化程度和毗邻大都市地区对偏差进行细分,确定这些变化最有可能产生影响的地方。我们的结果表明,差分隐私将对非西班牙裔黑人和西班牙裔的死亡率估计值产生比非西班牙裔白人更大的影响。我们还发现人口较少地区的估计死亡率有显著变化,按种族/族裔划分时变化更为明显。我们发现城市化程度较低或毗邻大都市地区的地区的估计死亡率变化更大,非西班牙裔黑人和西班牙裔的变化更大。这些发现强调了实施拟议的差分隐私回避方法对研究人口构成的影响,特别是对不同种族/族裔群体和城乡连续体的死亡率差异的影响。总体而言,它们展示了使用拟议的披露回避方法得出的数据产品所面临的挑战,同时突出了科学理解可能受到负面影响的关键实例。