Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA.
J Am Med Inform Assoc. 2021 Jul 30;28(8):1777-1784. doi: 10.1093/jamia/ocab069.
We propose a bidirectional GPS imputation method that can recover real-world mobility trajectories even when a substantial proportion of the data are missing. The time complexity of our online method is linear in the sample size, and it provides accurate estimates on daily or hourly summary statistics such as time spent at home and distance traveled.
To preserve a smartphone's battery, GPS may be sampled only for a small portion of time, frequently <10%, which leads to a substantial missing data problem. We developed an algorithm that simulates an individual's trajectory based on observed GPS location traces using sparse online Gaussian Process to addresses the high computational complexity of the existing method. The method also retains the spherical geometry of the problem, and imputes the missing trajectory in a bidirectional fashion with multiple condition checks to improve accuracy.
We demonstrated that (1) the imputed trajectories mimic the real-world trajectories, (2) the confidence intervals of summary statistics cover the ground truth in most cases, and (3) our algorithm is much faster than existing methods if we have more than 3 months of observations; (4) we also provide guidelines on optimal sampling strategies.
Our approach outperformed existing methods and was significantly faster. It can be used in settings in which data need to be analyzed and acted on continuously, for example, to detect behavioral anomalies that might affect treatment adherence, or to learn about colocations of individuals during an epidemic.
我们提出了一种双向 GPS 插补方法,即使数据缺失的比例较大,也可以恢复真实的移动轨迹。我们的在线方法的时间复杂度与样本量呈线性关系,可以准确估计日常或每小时的汇总统计数据,如在家时间和行驶距离。
为了节省智能手机电池,GPS 采样时间可能很短,通常 <10%,这会导致大量数据缺失问题。我们开发了一种算法,该算法使用稀疏在线高斯过程根据观察到的 GPS 位置轨迹模拟个体轨迹,以解决现有方法计算复杂度高的问题。该方法还保留了问题的球面几何结构,并通过多次条件检查以双向方式插补缺失轨迹,以提高准确性。
我们证明了(1)插补轨迹模拟真实世界轨迹,(2)汇总统计数据的置信区间在大多数情况下覆盖真实值,(3)如果我们有超过 3 个月的观测数据,我们的算法比现有方法快得多;(4)我们还提供了最佳采样策略的指南。
我们的方法优于现有方法,速度也快得多。它可用于需要连续分析和采取行动的环境中,例如,检测可能影响治疗依从性的行为异常,或了解流行病期间个体的聚集情况。