Suppr超能文献

分箱数据可更好地对可穿戴设备中缺失时间序列数据进行插补。

Binned Data Provide Better Imputation of Missing Time Series Data from Wearables.

机构信息

Rhenix Lifesciences, Hyderabad 500038, India.

Department of BioSciences and BioEngineering, Indian Institute of Technology, Guwahati 781039, India.

出版信息

Sensors (Basel). 2023 Jan 28;23(3):1454. doi: 10.3390/s23031454.

Abstract

The presence of missing values in a time-series dataset is a very common and well-known problem. Various statistical and machine learning methods have been developed to overcome this problem, with the aim of filling in the missing values in the data. However, the performances of these methods vary widely, showing a high dependence on the type of data and correlations within the data. In our study, we performed some of the well-known imputation methods, such as expectation maximization, k-nearest neighbor, iterative imputer, random forest, and simple imputer, to impute missing data obtained from smart, wearable health trackers. In this manuscript, we proposed the use of data binning for imputation. We showed that the use of data binned around the missing time interval provides a better imputation than the use of a whole dataset. Imputation was performed for 15 min and 1 h of continuous missing data. We used a dataset with different bin sizes, such as 15 min, 30 min, 45 min, and 1 h, and we carried out evaluations using root mean square error (RMSE) values. We observed that the expectation maximization algorithm worked best for the use of binned data. This was followed by the simple imputer, iterative imputer, and k-nearest neighbor, whereas the random forest method had no effect on data binning during imputation. Moreover, the smallest bin sizes of 15 min and 1 h were observed to provide the lowest RMSE values for the majority of the time frames during the imputation of 15 min and 1 h of missing data, respectively. Although applicable to digital health data, we think that this method will also find applicability in other domains.

摘要

在时间序列数据集存在缺失值是一个非常常见和熟知的问题。已经开发了各种统计和机器学习方法来克服这个问题,目的是填补数据中的缺失值。然而,这些方法的性能差异很大,表现出对数据类型和数据内部相关性的高度依赖。在我们的研究中,我们对一些知名的插补方法进行了研究,如期望最大化、k-最近邻、迭代插补、随机森林和简单插补,以插补从智能可穿戴健康追踪器中获得的缺失数据。在本文中,我们提出了使用数据分箱进行插补。我们表明,在缺失时间间隔周围使用数据分箱提供了比使用整个数据集更好的插补。对 15 分钟和 1 小时的连续缺失数据进行了插补。我们使用了不同的分箱大小数据集,如 15 分钟、30 分钟、45 分钟和 1 小时,并使用均方根误差(RMSE)值进行了评估。我们观察到,期望最大化算法在使用分箱数据方面效果最好。其次是简单插补、迭代插补和 k-最近邻,而随机森林方法在插补期间对数据分箱没有影响。此外,对于 15 分钟和 1 小时缺失数据的插补,观察到最小的分箱大小为 15 分钟和 1 小时,分别为大多数时间帧提供了最低的 RMSE 值。虽然适用于数字健康数据,但我们认为该方法也将在其他领域找到适用性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2565/9919790/42fd4120e536/sensors-23-01454-g001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验