Chew Alvin Wei Ze, Pan Yue, Wang Ying, Zhang Limao
Bentley Systems Research Office, 1 Harbourfront Pl, HarbourFront Tower One, Singapore 098633, Singapore.
Shanghai Key Laboratory for Digital Maintenance of Buildings and Infrastructure, Department of Civil Engineering, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai, China.
Knowl Based Syst. 2021 Dec 5;233:107417. doi: 10.1016/j.knosys.2021.107417. Epub 2021 Aug 24.
In this study, a hybrid deep-learning model termed as ODANN, built upon neural networks (NN) coupled with data assimilation and natural language processing (NLP) features extraction methods, has been constructed to concurrently process daily COVID-19 time-series records and large volumes of COVID-19 related Twitter data, as representative of the global community's aggregated emotional responses towards the current pandemic, to model the growth rate in the number of confirmed COVID-19 cases globally via a proposed G parameter. Overall, there were 3 key components to ODANN's development phase, namely: (i) data hydration and pre-processing were performed on COVID-19 related Twitter data ranging between 23 January 2020 and 10 May 2020, which amounted to over 100 million Tweets written in English language; (ii) multiple NLP features extraction methods were subsequently leveraged to encode the hydrated Twitter data into useful semantic word vectors for training ODANN under an optimal set of hyperparameters; and (iii) historical time-series data of defined characteristics were also assimilated into ODANN's selected hidden layer(s) to model the G parameter daily with a lead-time of 1 day. By far, our experimental results demonstrated that by adopting a rolling time-window size of 5 days, with respect to the number of historical time-series records for assimilating different data features, enabled ODANN to outperform other traditional time-series models and recent studies, in terms of the computed RMSE and MAE scores attained from the model's testing step. Overall, the summarized results from ODANN demonstrated its competitive edge in modelling and forecasting the growth rate in the number of COVID-19 cases globally.
在本研究中,构建了一种名为ODANN的混合深度学习模型,该模型基于神经网络(NN),并结合了数据同化和自然语言处理(NLP)特征提取方法,用于同时处理每日新冠肺炎时间序列记录和大量与新冠肺炎相关的推特数据,这些数据代表了全球社区对当前疫情的综合情绪反应,通过一个提议的G参数对全球新冠肺炎确诊病例数的增长率进行建模。总体而言,ODANN的开发阶段有3个关键组成部分,即:(i)对2020年1月23日至2020年5月10日期间与新冠肺炎相关的推特数据进行数据充实和预处理,这些数据包括超过1亿条用英语撰写的推文;(ii)随后利用多种NLP特征提取方法,在一组最优超参数下,将充实后的推特数据编码为有用的语义词向量,用于训练ODANN;(iii)还将具有特定特征的历史时间序列数据同化到ODANN选定的隐藏层中,以提前1天每日对G参数进行建模。到目前为止,我们的实验结果表明,通过采用5天的滚动时间窗口大小,相对于用于同化不同数据特征的历史时间序列记录数量,ODANN在模型测试步骤中获得的计算均方根误差(RMSE)和平均绝对误差(MAE)分数方面优于其他传统时间序列模型和近期研究。总体而言,ODANN的总结结果显示了其在全球新冠肺炎病例数增长率建模和预测方面的竞争优势。