Suppr超能文献

有效利用分箱数据来插补单变量时间序列数据。

Efficient use of binned data for imputing univariate time series data.

作者信息

Darji Jay, Biswas Nupur, Padul Vijay, Gill Jaya, Kesari Santosh, Ashili Shashaanka

机构信息

Rhenix Lifesciences, Hyderabad, Telangana, India.

CureScience, San Diego, CA, United States.

出版信息

Front Big Data. 2024 Aug 21;7:1422650. doi: 10.3389/fdata.2024.1422650. eCollection 2024.

Abstract

Time series data are recorded in various sectors, resulting in a large amount of data. However, the continuity of these data is often interrupted, resulting in periods of missing data. Several algorithms are used to impute the missing data, and the performance of these methods is widely varied. Apart from the choice of algorithm, the effective imputation depends on the nature of missing and available data. We conducted extensive studies using different types of time series data, specifically heart rate data and power consumption data. We generated the missing data for different time spans and imputed using different algorithms with binned data of different sizes. The performance was evaluated using the root mean square error (RMSE) metric. We observed a reduction in RMSE when using binned data compared to the entire dataset, particularly in the case of the expectation-maximization (EM) algorithm. We found that RMSE was reduced when using binned data for 1-, 5-, and 15-min missing data, with greater reduction observed for 15-min missing data. We also observed the effect of data fluctuation. We conclude that the usefulness of binned data depends precisely on the span of missing data, sampling frequency of the data, and fluctuation within data. Depending on the inherent characteristics, quality, and quantity of the missing and available data, binned data can impute a wide variety of data, including biological heart rate data derived from the Internet of Things (IoT) device smartwatch and non-biological data such as household power consumption data.

摘要

时间序列数据记录于各个领域,产生了大量数据。然而,这些数据的连续性常常被打断,导致出现数据缺失期。有几种算法用于插补缺失数据,这些方法的性能差异很大。除了算法的选择,有效的插补还取决于缺失数据和可用数据的性质。我们使用不同类型的时间序列数据进行了广泛研究,特别是心率数据和功耗数据。我们针对不同的时间跨度生成缺失数据,并使用不同大小的分箱数据通过不同算法进行插补。使用均方根误差(RMSE)指标评估性能。我们观察到,与整个数据集相比,使用分箱数据时RMSE有所降低,特别是在期望最大化(EM)算法的情况下。我们发现,对于1分钟、5分钟和15分钟的缺失数据,使用分箱数据时RMSE会降低,15分钟缺失数据的降低幅度更大。我们还观察到了数据波动的影响。我们得出结论,分箱数据的有用性恰恰取决于缺失数据的跨度、数据的采样频率以及数据内部的波动。根据缺失数据和可用数据的固有特征、质量和数量,分箱数据可以插补各种各样的数据,包括来自物联网(IoT)设备智能手表的生物心率数据和家庭功耗数据等非生物数据。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4e94/11371617/1bb73b8ecaab/fdata-07-1422650-g0001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验