Dong Evan, Schein Aaron, Wang Yixin, Garg Nikhil
Department of Computer Science, Cornell University, Ithaca, NY 14853, USA.
Department of Statistics and Data Science Institute, University of Chicago, Chicago, IL 60637, USA.
PNAS Nexus. 2025 Jan 30;4(2):pgaf027. doi: 10.1093/pnasnexus/pgaf027. eCollection 2025 Feb.
Racial and other demographic imputation is necessary for many applications, especially in auditing disparities and outreach targeting in political campaigns. The canonical approach is to construct continuous predictions-e.g. based on name and geography-and then to often the predictions by selecting the most likely class (argmax), potentially with a minimum threshold (thresholding). We study how this practice produces . For example, we show that argmax labeling, as used by a prominent commercial voter file vendor to impute race/ethnicity, results in a substantial under-count of Black voters, e.g. by 28.2% points in North Carolina. This bias can have substantial implications in downstream tasks that use such labels. We then introduce a approach-and a tractable heuristic-that can eliminate this bias, with negligible individual-level accuracy loss. Finally, we theoretically analyze discretization bias, show that calibrated continuous models are insufficient to eliminate it, and that an approach such as ours is necessary. Broadly, we warn researchers and practitioners against discretizing continuous demographic predictions without considering downstream consequences.
种族和其他人口统计数据的插补对于许多应用来说是必要的,特别是在审计政治活动中的差异和外展目标时。典型的方法是构建连续预测——例如基于姓名和地理位置——然后通常通过选择最可能的类别(argmax)来对预测进行分类,可能还会设置一个最小阈值(阈值化)。我们研究了这种做法是如何产生[具体结果未明确的内容]的。例如,我们表明,一家著名的商业选民档案供应商用于插补种族/族裔的argmax标签导致黑人选民的数量大幅少计,例如在北卡罗来纳州少计了28.2个百分点。这种偏差在使用此类标签的下游任务中可能会产生重大影响。然后,我们引入了一种[未明确的方法]——以及一种易于处理的[未明确的启发式方法]——可以消除这种偏差,同时个体层面的准确性损失可忽略不计。最后,我们从理论上分析了离散化偏差,表明经过校准的连续模型不足以消除它,而像我们这样的方法是必要的。总体而言,我们警告研究人员和从业者,在不考虑下游后果的情况下不要对连续的人口统计预测进行离散化处理。