Instituto de Medicina Social, Universidade do Estado do Rio de Janeiro, Rio de Janeiro, Brasil.
Instituto de Saúde Coletiva, Universidade Federal da Bahia, Salvador, Brasil.
Cad Saude Publica. 2021 Jul 28;37(7):e00039321. doi: 10.1590/0102-311X00039321. eCollection 2021.
Strategies for improving geocoded data often rely on interactive manual processes that can be time-consuming and impractical for large-scale projects. In this study, we evaluated different automated strategies for improving address quality and geocoding matching rates using a large dataset of addresses from death records in Rio de Janeiro, Brazil. Mortality data included 132,863 records with address information in a structured format. We performed regular expressions and dictionary-based methods for address standardization and enrichment. All records were linked by their postal code or street name to the Brazilian National Address Directory (DNE) obtained from Brazil's Postal Service. Residential addresses were geocoded using Google Maps. Records with address data validated down to the street level and location type returned as rooftop, range interpolated, or geometric center were considered a geocoding match. The overall performance was assessed by manually reviewing a sample of addresses. Out of the original 132,863 records, 85.7% (n = 113,876) were geocoded and validated, out of which 83.8% were matched as rooftop (high accuracy). Overall sensitivity and specificity were 87% (95%CI: 86-88) and 98% (95%CI: 96-99), respectively. Our results indicate that address quality and geocoding completeness can be reliably improved with an automated geocoding process. R scripts and instructions to reproduce all the analyses are available at https://github.com/reprotc/geocoding.
改进地理编码数据的策略通常依赖于交互式手动处理,对于大规模项目来说可能既耗时又不切实际。在这项研究中,我们评估了不同的自动化策略,以提高使用巴西里约热内卢死亡记录中大型地址数据集的地址质量和地理编码匹配率。死亡率数据包括 132863 条记录,其中包含结构化格式的地址信息。我们对地址进行了标准化和丰富化,使用了正则表达式和基于字典的方法。所有记录都通过邮政编码或街道名称与巴西邮政局获得的巴西国家地址目录 (DNE) 相关联。住宅地址使用谷歌地图进行地理编码。地址数据验证到街道级别和位置类型(返回屋顶、范围插值或几何中心)的记录被视为地理编码匹配。通过手动审查地址样本评估整体性能。在最初的 132863 条记录中,85.7%(n=113876)被地理编码和验证,其中 83.8%被匹配为屋顶(高精度)。总体灵敏度和特异性分别为 87%(95%CI:86-88)和 98%(95%CI:96-99)。我们的结果表明,地理编码过程的自动化可以可靠地提高地址质量和地理编码的完整性。可在 https://github.com/reprotc/geocoding 上获得用于重现所有分析的 R 脚本和说明。