New York University, New York, NY, United States of America.
Microsoft Research, Redmond, WA, United States of America.
PLoS One. 2021 Jun 9;16(6):e0252383. doi: 10.1371/journal.pone.0252383. eCollection 2021.
Estimation of disease prevalence at sub-city neighborhood scale allows early and targeted interventions that can help save lives and reduce public health burdens. However, the cost-prohibitive nature of highly localized data collection and sparsity of representative signals, has made it challenging to identify neighborhood scale prevalence of disease. To overcome this challenge, we utilize alternative data sources, which are both less sparse and representative of localized disease prevalence: using query data from a large commercial search engine, we identify the prevalence of respiratory illness in the United States, localized to census tract geographic granularity. Focusing on asthma and Chronic Obstructive Pulmonary Disease (COPD), we construct a set of features based on searches for symptoms, medications, and disease-related information, and use these to identify illness rates in more than 23 thousand tracts in 500 cities across the United States. Out of sample model estimates from search data alone correlate with ground truth illness rate estimates from the CDC at 0.69 to 0.76, with simple additions to these models raising those correlations to as high as 0.84. We then show that in practice search query data can be added to other relevant data such as census or land cover data to boost results, with models that incorporate all data sources correlating with ground truth data at 0.91 for asthma and 0.88 for COPD.
在城市街区层面估算疾病的流行程度,可以实现早期、有针对性的干预,有助于拯救生命和减轻公共卫生负担。然而,由于高度本地化的数据收集成本过高,以及代表性信号稀疏,因此难以确定疾病在街区层面的流行程度。为了克服这一挑战,我们利用替代数据源,这些数据源不仅更稀疏,而且更能代表本地化的疾病流行程度:我们使用来自大型商业搜索引擎的查询数据,以普查地段的地理粒度来确定美国呼吸道疾病的流行程度。我们专注于哮喘和慢性阻塞性肺疾病(COPD),根据症状、药物和与疾病相关的信息搜索构建了一组特征,并利用这些特征来确定美国 500 个城市的 23000 多个地段的疾病发病率。仅从搜索数据中得出的样本外模型估计与 CDC 的实际疾病发病率估计相关系数为 0.69 至 0.76,而对这些模型进行简单的补充可将相关性提高到 0.84 之高。然后我们展示了在实践中,搜索查询数据可以与其他相关数据(如人口普查或土地覆盖数据)结合使用,以提高结果的准确性,同时包含所有数据源的模型与真实数据的相关性分别达到 0.91(哮喘)和 0.88(COPD)。