Hossen Md Khalid, Peng Yan-Tsung, Chen Meng Chang
Social Networks and Human-Centered Computing, Taiwan International Graduate Program, Academia Sinca, Taipei, Taiwan.
Department of Computer Science, National Chengchi University, Taipei, Taiwan.
PLoS One. 2025 Feb 11;20(2):e0314327. doi: 10.1371/journal.pone.0314327. eCollection 2025.
In many deep learning tasks, it is assumed that the data used in the training process is sampled from the same distribution. However, this may not be accurate for data collected from different contexts or during different periods. For instance, the temperatures in a city can vary from year to year due to various unclear reasons. In this paper, we utilized three distinct statistical techniques to analyze annual data drifting at various stations. These techniques calculate the P values for each station by comparing data from five years (2014-2018) to identify data drifting phenomena. To find out the data drifting scenario those statistical techniques and calculate the P value from those techniques to measure the data drifting in specific locations. From those statistical techniques, the highest drifting stations can be identified from the previous year's datasets To identify data drifting and highlight areas with significant drift, we utilized meteorological air quality and weather data in this study. We proposed two models that consider the characteristics of data drifting for PM2.5 prediction and compared them with various deep learning models, such as Long Short-Term Memory (LSTM) and its variants, for predictions from the next hour to the 64th hour. Our proposed models significantly outperform traditional neural networks. Additionally, we introduced a wrapped loss function incorporated into a model, resulting in more accurate results compared to those using the original loss function alone and prediction has been evaluated by RMSE, MAE and MAPE metrics. The proposed Front-loaded connection model(FLC) and Back-loaded connection model (BLC) solve the data drifting issue and the wrap loss function also help alleviate the data drifting problem with model training and works for the neural network models to achieve more accurate results. Eventually, the experimental results have shown that the proposed model performance enhanced from 24.1% -16%, 12%-8.3% respectively at 1h-24h, 32h-64h with compared to baselines BILSTM model, by 24.6% -11.8%, 10%-10.2% respectively at 1h-24h, 32h-64h compared to CNN model in hourly PM2.5 predictions.
在许多深度学习任务中,人们假定训练过程中使用的数据是从同一分布中采样得到的。然而,对于从不同背景或不同时期收集的数据而言,这可能并不准确。例如,由于各种不明原因,一个城市的温度可能逐年变化。在本文中,我们运用了三种不同的统计技术来分析各站点的年度数据漂移情况。这些技术通过比较五年(2014 - 2018年)的数据来计算每个站点的P值,以识别数据漂移现象。为了找出数据漂移情况,那些统计技术并从这些技术中计算P值来衡量特定位置的数据漂移。从那些统计技术中,可以从前一年的数据集中识别出漂移程度最高的站点。为了识别数据漂移并突出显著漂移的区域,我们在本研究中使用了气象空气质量和天气数据。我们提出了两种考虑数据漂移特征的模型用于PM2.5预测,并将它们与各种深度学习模型(如长短期记忆网络(LSTM)及其变体)进行比较,以进行从下一小时到第64小时的预测。我们提出的模型显著优于传统神经网络。此外,我们引入了一个包含在模型中的包装损失函数,与仅使用原始损失函数相比,得到了更准确的结果,并且预测已通过均方根误差(RMSE)、平均绝对误差(MAE)和平均绝对百分比误差(MAPE)指标进行评估。所提出的前加载连接模型(FLC)和后加载连接模型(BLC)解决了数据漂移问题,并且包装损失函数也有助于在模型训练中缓解数据漂移问题,并适用于神经网络模型以获得更准确的结果。最终,实验结果表明,与基线双向长短期记忆网络(BILSTM)模型相比,所提出的模型在1小时 - 24小时、32小时 - 64小时的性能分别提高了24.1% - 16%、