Kim Ki-Hun, Kim Kwang-Jae
Faculty of Industrial Design Engineering, Delft University of Technology, Delft, Netherlands.
Department of Industrial Engineering, Ulsan National Institute of Science and Technology, Ulsan, Republic of Korea.
JMIR Med Inform. 2020 Dec 17;8(12):e20597. doi: 10.2196/20597.
A lifelogs-based wellness index (LWI) is a function for calculating wellness scores based on health behavior lifelogs (eg, daily walking steps and sleep times collected via a smartwatch). A wellness score intuitively shows the users of smart wellness services the overall condition of their health behaviors. LWI development includes estimation (ie, estimating coefficients in LWI with data). A panel data set comprising health behavior lifelogs allows LWI estimation to control for unobserved variables, thereby resulting in less bias. However, these data sets typically have missing data due to events that occur in daily life (eg, smart devices stop collecting data when batteries are depleted), which can introduce biases into LWI coefficients. Thus, the appropriate choice of method to handle missing data is important for reducing biases in LWI estimations with panel data. However, there is a lack of research in this area.
This study aims to identify a suitable missing-data handling method for LWI estimation with panel data.
Listwise deletion, mean imputation, expectation maximization-based multiple imputation, predictive-mean matching-based multiple imputation, k-nearest neighbors-based imputation, and low-rank approximation-based imputation were comparatively evaluated by simulating an existing case of LWI development. A panel data set comprising health behavior lifelogs of 41 college students over 4 weeks was transformed into a reference data set without any missing data. Then, 200 simulated data sets were generated by randomly introducing missing data at proportions from 1% to 80%. The missing-data handling methods were each applied to transform the simulated data sets into complete data sets, and coefficients in a linear LWI were estimated for each complete data set. For each proportion for each method, a bias measure was calculated by comparing the estimated coefficient values with values estimated from the reference data set.
Methods performed differently depending on the proportion of missing data. For 1% to 30% proportions, low-rank approximation-based imputation, predictive-mean matching-based multiple imputation, and expectation maximization-based multiple imputation were superior. For 31% to 60% proportions, low-rank approximation-based imputation and predictive-mean matching-based multiple imputation performed best. For over 60% proportions, only low-rank approximation-based imputation performed acceptably.
Low-rank approximation-based imputation was the best of the 6 data-handling methods regardless of the proportion of missing data. This superiority is generalizable to other panel data sets comprising health behavior lifelogs given their verified low-rank nature, for which low-rank approximation-based imputation is known to perform effectively. This result will guide missing-data handling in reducing coefficient biases in new development cases of linear LWIs with panel data.
基于生活日志的健康指数(LWI)是一种根据健康行为生活日志(例如,通过智能手表收集的每日步数和睡眠时间)计算健康得分的函数。健康得分直观地向智能健康服务的用户展示其健康行为的整体状况。LWI的开发包括估计(即,使用数据估计LWI中的系数)。包含健康行为生活日志的面板数据集允许LWI估计控制未观察到的变量,从而减少偏差。然而,由于日常生活中发生的事件(例如,电池耗尽时智能设备停止收集数据),这些数据集通常存在缺失数据,这可能会给LWI系数带来偏差。因此,选择合适的方法处理缺失数据对于减少面板数据LWI估计中的偏差很重要。然而,这一领域缺乏研究。
本研究旨在确定一种适用于面板数据LWI估计的缺失数据处理方法。
通过模拟现有的LWI开发案例,对逐行删除、均值插补、基于期望最大化的多重插补、基于预测均值匹配的多重插补、基于k近邻的插补和基于低秩近似的插补进行了比较评估。将一个包含41名大学生4周健康行为生活日志的面板数据集转换为一个没有任何缺失数据的参考数据集。然后,通过以1%至80%的比例随机引入缺失数据,生成200个模拟数据集。每种缺失数据处理方法都应用于将模拟数据集转换为完整数据集,并为每个完整数据集估计线性LWI中的系数。对于每种方法的每个比例,通过将估计的系数值与从参考数据集估计的值进行比较来计算偏差度量。
不同方法的表现因缺失数据的比例而异。对于1%至30%的比例,基于低秩近似的插补、基于预测均值匹配的多重插补和基于期望最大化的多重插补表现更优。对于31%至60%的比例,基于低秩近似的插补和基于预测均值匹配的多重插补表现最佳。对于超过60%的比例,只有基于低秩近似的插补表现尚可。
无论缺失数据的比例如何,基于低秩近似的插补是6种数据处理方法中最好的。鉴于已证实其低秩性质,这种优越性可推广到其他包含健康行为生活日志的面板数据集,已知基于低秩近似的插补在这些数据集上能有效执行。这一结果将指导在使用面板数据的线性LWI新开发案例中处理缺失数据以减少系数偏差。