Memon Shahan Ali, Razak Saquib, Weber Ingmar
Language Technologies Institute, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, United States.
Carnegie Mellon University, Doha, Qatar.
J Med Internet Res. 2020 Jan 27;22(1):e13347. doi: 10.2196/13347.
As the process of producing official health statistics for lifestyle diseases is slow, researchers have explored using Web search data as a proxy for lifestyle disease surveillance. Existing studies, however, are prone to at least one of the following issues: ad-hoc keyword selection, overfitting, insufficient predictive evaluation, lack of generalization, and failure to compare against trivial baselines.
The aims of this study were to (1) employ a corrective approach improving previous methods; (2) study the key limitations in using Google Trends for lifestyle disease surveillance; and (3) test the generalizability of our methodology to other countries beyond the United States.
For each of the target variables (diabetes, obesity, and exercise), prevalence rates were collected. After a rigorous keyword selection process, data from Google Trends were collected. These data were denormalized to form spatio-temporal indices. L1-regularized regression models were trained to predict prevalence rates from denormalized Google Trends indices. Models were tested on a held-out set and compared against baselines from the literature as well as a trivial last year equals this year baseline. A similar analysis was done using a multivariate spatio-temporal model where the previous year's prevalence was included as a covariate. This model was modified to create a time-lagged regression analysis framework. Finally, a hierarchical time-lagged multivariate spatio-temporal model was created to account for subnational trends in the data. The model trained on US data was, then, applied in a transfer learning framework to Canada.
In the US context, our proposed models beat the performances of the prior work, as well as the trivial baselines. In terms of the mean absolute error (MAE), the best of our proposed models yields 24% improvement (0.72-0.55; P<.001) for diabetes; 18% improvement (1.20-0.99; P=.001) for obesity, and 34% improvement (2.89-1.95; P<.001) for exercise. Our proposed across-country transfer learning framework also shows promising results with an average Spearman and Pearson correlation of 0.70 for diabetes and 0.90 and 0.91 for obesity, respectively.
Although our proposed models beat the baselines, we find the modeling of lifestyle diseases to be a challenging problem, one that requires an abundance of data as well as creative modeling strategies. In doing so, this study shows a low-to-moderate validity of Google Trends in the context of lifestyle disease surveillance, even when applying novel corrective approaches, including a proposed denormalization scheme. We envision qualitative analyses to be a more practical use of Google Trends in the context of lifestyle disease surveillance. For the quantitative analyses, the highest utility of using Google Trends is in the context of transfer learning where low-resource countries could benefit from high-resource countries by using proxy models.
由于生成生活方式疾病官方健康统计数据的过程缓慢,研究人员已探索使用网络搜索数据作为生活方式疾病监测的替代指标。然而,现有研究至少容易出现以下问题之一:临时关键词选择、过度拟合、预测评估不足、缺乏普遍性以及未能与简单基线进行比较。
本研究的目的是:(1)采用一种改进先前方法的校正方法;(2)研究使用谷歌趋势进行生活方式疾病监测的关键局限性;(3)测试我们方法在美国以外其他国家的通用性。
针对每个目标变量(糖尿病、肥胖症和运动)收集患病率数据。经过严格的关键词选择过程后,收集来自谷歌趋势的数据。这些数据进行反归一化以形成时空指数。训练L1正则化回归模型,根据反归一化的谷歌趋势指数预测患病率。在一个留出的数据集上对模型进行测试,并与文献中的基线以及一个简单的去年等于今年的基线进行比较。使用多元时空模型进行类似分析,其中将上一年的患病率作为协变量纳入。对该模型进行修改以创建时间滞后回归分析框架。最后,创建一个分层时间滞后多元时空模型以考虑数据中的次国家趋势。然后,将在美国数据上训练的模型应用于迁移学习框架中的加拿大。
在美国的背景下,我们提出的模型优于先前工作的性能以及简单基线。就平均绝对误差(MAE)而言,我们提出的最佳模型在糖尿病方面提高了24%(0.72 - 0.55;P <.001);在肥胖症方面提高了18%(1.20 - 0.99;P =.001),在运动方面提高了34%(2.89 - 1.95;P <.001)。我们提出的跨国迁移学习框架也显示出有希望的结果,糖尿病的平均斯皮尔曼和皮尔逊相关性分别为0.70,肥胖症的分别为0.90和0.91。
尽管我们提出的模型优于基线,但我们发现生活方式疾病的建模是一个具有挑战性的问题,需要大量数据以及创造性的建模策略。在此过程中,本研究表明,即使应用新颖的校正方法,包括提出的反归一化方案,谷歌趋势在生活方式疾病监测背景下的有效性也较低至中等。我们设想定性分析在生活方式疾病监测背景下对谷歌趋势的使用更具实用性。对于定量分析,使用谷歌趋势的最大效用在于迁移学习背景下,资源匮乏的国家可以通过使用代理模型从资源丰富的国家受益。