Ruhnke Simon A, Wilson Fernando A, Stimpson Jim P
Berliner Institut für empirische Integrations- und Migrationsforschung/BIM, Berlin, Germany.
University of Utah, Matheson Center for Health Care Studies, Salt Lake City, UT.
MethodsX. 2022 Sep 8;9:101848. doi: 10.1016/j.mex.2022.101848. eCollection 2022.
We describe a novel machine learning method of imputing legal status for immigrants using nationally representative survey data from the Survey of Income and Program Participation (SIPP) and the National Health Interview Survey (NHIS). K-nearest Neighbor (KNN) classifier and Random Forest (RF) Algorithm machine learning were described as novel imputation methods compared to established regression-based imputation. After validating the imputation methods using sensitivity, specificity, positive predictive value (PPV) and accuracy statistics, the Random Forest Algorithm was more accurate in identifying undocumented immigrants and minimized bias in both socio-demographic variables included in the imputation, and unobserved health variables relative to regression-based imputation and KNN.•We developed a new machine learning method of imputing legal status for immigrants that can be used with nationally representative, publicly available data.•Our findings indicate that using machine learning to impute legal status of immigrants, specifically the Random Forest Algorithm, was more accurate in identifying undocumented immigrants and minimized bias relative to other imputation methods.
我们描述了一种新颖的机器学习方法,该方法利用来自收入与项目参与调查(SIPP)和国家健康访谈调查(NHIS)的具有全国代表性的调查数据,来估算移民的法律身份。与既定的基于回归的插补方法相比,K近邻(KNN)分类器和随机森林(RF)算法机器学习被描述为新颖的插补方法。在使用敏感性、特异性、阳性预测值(PPV)和准确性统计数据对插补方法进行验证之后,随机森林算法在识别无证移民方面更为准确,并且相对于基于回归的插补和KNN,在插补中所包含的社会人口统计学变量以及未观察到的健康变量方面,将偏差降至最低。
•我们开发了一种用于估算移民法律身份的新机器学习方法,该方法可用于具有全国代表性的公开可用数据。
•我们的研究结果表明,使用机器学习来估算移民的法律身份,特别是随机森林算法,在识别无证移民方面更为准确,并且相对于其他插补方法,将偏差降至最低。