有效利用分箱数据来插补单变量时间序列数据。

Efficient use of binned data for imputing univariate time series data.

作者信息

Darji Jay, Biswas Nupur, Padul Vijay, Gill Jaya, Kesari Santosh, Ashili Shashaanka

机构信息

Rhenix Lifesciences, Hyderabad, Telangana, India.

CureScience, San Diego, CA, United States.

出版信息

Front Big Data. 2024 Aug 21;7:1422650. doi: 10.3389/fdata.2024.1422650. eCollection 2024.

DOI:10.3389/fdata.2024.1422650

PMID:39234189

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11371617/

Abstract

Time series data are recorded in various sectors, resulting in a large amount of data. However, the continuity of these data is often interrupted, resulting in periods of missing data. Several algorithms are used to impute the missing data, and the performance of these methods is widely varied. Apart from the choice of algorithm, the effective imputation depends on the nature of missing and available data. We conducted extensive studies using different types of time series data, specifically heart rate data and power consumption data. We generated the missing data for different time spans and imputed using different algorithms with binned data of different sizes. The performance was evaluated using the root mean square error (RMSE) metric. We observed a reduction in RMSE when using binned data compared to the entire dataset, particularly in the case of the expectation-maximization (EM) algorithm. We found that RMSE was reduced when using binned data for 1-, 5-, and 15-min missing data, with greater reduction observed for 15-min missing data. We also observed the effect of data fluctuation. We conclude that the usefulness of binned data depends precisely on the span of missing data, sampling frequency of the data, and fluctuation within data. Depending on the inherent characteristics, quality, and quantity of the missing and available data, binned data can impute a wide variety of data, including biological heart rate data derived from the Internet of Things (IoT) device smartwatch and non-biological data such as household power consumption data.

摘要

时间序列数据记录于各个领域，产生了大量数据。然而，这些数据的连续性常常被打断，导致出现数据缺失期。有几种算法用于插补缺失数据，这些方法的性能差异很大。除了算法的选择，有效的插补还取决于缺失数据和可用数据的性质。我们使用不同类型的时间序列数据进行了广泛研究，特别是心率数据和功耗数据。我们针对不同的时间跨度生成缺失数据，并使用不同大小的分箱数据通过不同算法进行插补。使用均方根误差（RMSE）指标评估性能。我们观察到，与整个数据集相比，使用分箱数据时RMSE有所降低，特别是在期望最大化（EM）算法的情况下。我们发现，对于1分钟、5分钟和15分钟的缺失数据，使用分箱数据时RMSE会降低，15分钟缺失数据的降低幅度更大。我们还观察到了数据波动的影响。我们得出结论，分箱数据的有用性恰恰取决于缺失数据的跨度、数据的采样频率以及数据内部的波动。根据缺失数据和可用数据的固有特征、质量和数量，分箱数据可以插补各种各样的数据，包括来自物联网（IoT）设备智能手表的生物心率数据和家庭功耗数据等非生物数据。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4e94/11371617/1bb73b8ecaab/fdata-07-1422650-g0001.jpg

相似文献

Efficient use of binned data for imputing univariate time series data.有效利用分箱数据来插补单变量时间序列数据。

Front Big Data. 2024 Aug 21;7:1422650. doi: 10.3389/fdata.2024.1422650. eCollection 2024.

Binned Data Provide Better Imputation of Missing Time Series Data from Wearables.分箱数据可更好地对可穿戴设备中缺失时间序列数据进行插补。

Sensors (Basel). 2023 Jan 28;23(3):1454. doi: 10.3390/s23031454.

Advanced methods for missing values imputation based on similarity learning.基于相似性学习的缺失值插补先进方法。

PeerJ Comput Sci. 2021 Jul 21;7:e619. doi: 10.7717/peerj-cs.619. eCollection 2021.

Imputation of missing values for cochlear implant candidate audiometric data and potential applications.人工耳蜗候选者听力学数据缺失值的推断及其潜在应用。

PLoS One. 2023 Feb 6;18(2):e0281337. doi: 10.1371/journal.pone.0281337. eCollection 2023.

Spatial imputation for air pollutants data sets via low rank matrix completion algorithm.基于低秩矩阵补全算法的大气污染物数据集的空间插补。

Environ Int. 2020 Jun;139:105713. doi: 10.1016/j.envint.2020.105713. Epub 2020 Apr 11.

Deep Learning Approach for Imputation of Missing Values in Actigraphy Data: Algorithm Development Study.深度学习方法在运动数据缺失值插补中的应用：算法开发研究。

JMIR Mhealth Uhealth. 2020 Jul 23;8(7):e16113. doi: 10.2196/16113.

Comparison of the effects of imputation methods for missing data in predictive modelling of cohort study datasets.缺失数据插补方法对队列研究数据集预测建模效果的比较。

BMC Med Res Methodol. 2024 Feb 16;24(1):41. doi: 10.1186/s12874-024-02173-x.

Evaluating Methods for Imputing Missing Data from Longitudinal Monitoring of Athlete Workload.评估从运动员负荷的纵向监测中插补缺失数据的方法。

J Sports Sci Med. 2021 Mar 5;20(2):188-196. doi: 10.52082/jssm.2021.188. eCollection 2021 Jun.

A Dynamic Model for Imputing Missing Medical Data: A Multiobjective Particle Swarm Optimization Algorithm.用于推断缺失医学数据的动态模型：一种多目标粒子群优化算法。

J Healthc Eng. 2021 Oct 8;2021:1203726. doi: 10.1155/2021/1203726. eCollection 2021.

Missing data imputation via the expectation-maximization algorithm can improve principal component analysis aimed at deriving biomarker profiles and dietary patterns.通过期望最大化算法进行缺失数据插补可以改进主成分分析，以得出生物标志物图谱和饮食模式。

Nutr Res. 2020 Mar;75:67-76. doi: 10.1016/j.nutres.2020.01.001. Epub 2020 Jan 9.

本文引用的文献

Binned Data Provide Better Imputation of Missing Time Series Data from Wearables.分箱数据可更好地对可穿戴设备中缺失时间序列数据进行插补。

Sensors (Basel). 2023 Jan 28;23(3):1454. doi: 10.3390/s23031454.

Physical Activity Practice and Healthy Lifestyles Related to Resting Heart Rate in Health Sciences First-Year Students.健康科学专业一年级学生的体育活动实践及与静息心率相关的健康生活方式

Am J Lifestyle Med. 2019 Oct 8;16(1):101-108. doi: 10.1177/1559827619878661. eCollection 2022 Jan-Feb.

Imputation by feature importance (IBFI): A methodology to envelop machine learning method for imputing missing patterns in time series data.基于特征重要性的插补（IBFI）：一种封装机器学习方法以插补时间序列数据中缺失模式的方法。

PLoS One. 2022 Jan 13;17(1):e0262131. doi: 10.1371/journal.pone.0262131. eCollection 2022.

Accuracy of random-forest-based imputation of missing data in the presence of non-normality, non-linearity, and interaction.基于随机森林的缺失数据插补在非正态性、非线性和交互作用存在下的准确性。

BMC Med Res Methodol. 2020 Jul 25;20(1):199. doi: 10.1186/s12874-020-01080-1.

Random Forest Missing Data Algorithms.随机森林缺失数据算法

Stat Anal Data Min. 2017 Dec;10(6):363-377. doi: 10.1002/sam.11348. Epub 2017 Jun 13.

MissForest--non-parametric missing value imputation for mixed-type data.MissForest--用于混合类型数据的非参数缺失值插补。

Bioinformatics. 2012 Jan 1;28(1):112-8. doi: 10.1093/bioinformatics/btr597. Epub 2011 Oct 28.

Simple parametric survival analysis with anonymized register data: A cohort study with truncated and interval censored event and censoring times.使用匿名登记数据的简单参数生存分析：一项针对截断和区间删失事件及删失时间的队列研究。

BMC Res Notes. 2011 Aug 25;4:308. doi: 10.1186/1756-0500-4-308.

The effects of the irregular sample and missing data in time series analysis.时间序列分析中不规则样本和缺失数据的影响。

Nonlinear Dynamics Psychol Life Sci. 2006 Apr;10(2):187-214.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

有效利用分箱数据来插补单变量时间序列数据。

Efficient use of binned data for imputing univariate time series data.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

本文引用的文献