Department of Computer Science and Engineering, Tandon School of Engineering, Brooklyn, NY 11201.
Department of Biostatistics, School of Global Public Health, New York, NY 10003.
Proc Natl Acad Sci U S A. 2024 Sep 24;121(39):e2402387121. doi: 10.1073/pnas.2402387121. Epub 2024 Sep 17.
New data sources and AI methods for extracting information are increasingly abundant and relevant to decision-making across societal applications. A notable example is street view imagery, available in over 100 countries, and purported to inform built environment interventions (e.g., adding sidewalks) for community health outcomes. However, biases can arise when decision-making does not account for data robustness or relies on spurious correlations. To investigate this risk, we analyzed 2.02 million Google Street View (GSV) images alongside health, demographic, and socioeconomic data from New York City. Findings demonstrate robustness challenges; built environment characteristics inferred from GSV labels at the intracity level often do not align with ground truth. Moreover, as average individual-level behavior of physical inactivity significantly mediates the impact of built environment features by census tract, intervention on features measured by GSV would be misestimated without proper model specification and consideration of this mediation mechanism. Using a causal framework accounting for these mediators, we determined that intervening by improving 10% of samples in the two lowest tertiles of physical inactivity would lead to a 4.17 (95% CI 3.84-4.55) or 17.2 (95% CI 14.4-21.3) times greater decrease in the prevalence of obesity or diabetes, respectively, compared to the same proportional intervention on the number of crosswalks by census tract. This study highlights critical issues of robustness and model specification in using emergent data sources, showing the data may not measure what is intended, and ignoring mediators can result in biased intervention effect estimates.
新的数据来源和人工智能方法越来越丰富,与社会应用中的决策相关。一个显著的例子是街景图像,在 100 多个国家都有可用,据称可以为社区健康结果提供有关建筑环境干预(例如,增加人行道)的信息。然而,如果决策不考虑数据稳健性或依赖于虚假相关性,就会出现偏差。为了研究这种风险,我们分析了 202 万张谷歌街景(GSV)图像以及来自纽约市的健康、人口统计和社会经济数据。研究结果表明存在稳健性挑战;从 GSV 标签推断出的城市内部的建筑环境特征与实地情况并不一致。此外,由于身体活动不足的个体平均行为在很大程度上调节了按普查区划分的建筑环境特征的影响,因此如果没有适当的模型规范和对这种中介机制的考虑,基于 GSV 测量的特征进行干预,其效果将被高估。使用考虑到这些中介因素的因果框架,我们确定,通过改善身体活动最低两个三分位数中 10%的样本,与按普查区划分的横道数量进行相同比例的干预相比,肥胖或糖尿病的患病率分别降低 4.17(95%CI 3.84-4.55)或 17.2(95%CI 14.4-21.3)倍。本研究强调了在使用新兴数据源时稳健性和模型规范的关键问题,表明数据可能无法测量预期的内容,并且忽略中介因素可能导致干预效果估计存在偏差。