Biomedical Informatics Center, School of Medicine and Health Sciences, George Washington University, Washington, D.C., USA.
Biomedical Informatics Center, School of Medicine and Health Sciences, George Washington University, Washington, D.C., USA.
Int J Med Inform. 2021 Mar;147:104368. doi: 10.1016/j.ijmedinf.2020.104368. Epub 2020 Dec 16.
The data quality of electronic health records (EHR) has been a topic of increasing interest to clinical and health services researchers. One indicator of possible errors in data is a large change in the frequency of observations in chronic illnesses. In this study, we built and demonstrated the utility of a stacked multivariate LSTM model to predict an acceptable range for the frequency of observations.
We applied the LSTM approach to a large EHR dataset with over 400 million total encounters. We computed sensitivity and specificity for predicting if the frequency of an observation in a given week is an aberrant signal.
Compared with the simple frequency monitoring approach, our proposed multivariate LSTM approach increased the sensitivity of finding aberrant signals in 6 randomly selected diagnostic codes from 75 to 88% and the specificity from 68 to 91%. We also experimented with two different LSTM algorithms, namely, direct multi-step and recursive multi-step. Both models were able to detect the aberrant signals while the recursive multi-step algorithm performed better.
Simply monitoring the frequency trend, as is the common practice in systems that do monitor the data quality, would not be able to distinguish between the fluctuations caused by seasonal disease changes, seasonal patient visits, or a change in data sources. Our study demonstrated the ability of stacked multivariate LSTM models to recognize true data quality issues rather than fluctuations that are caused by different reasons, including seasonal changes and outbreaks.
电子健康记录(EHR)的数据质量一直是临床和卫生服务研究人员越来越关注的话题。数据中可能存在错误的一个指标是慢性病观察频率的大幅变化。在这项研究中,我们构建并展示了堆叠多变量 LSTM 模型的效用,以预测观察频率的可接受范围。
我们将 LSTM 方法应用于一个拥有超过 4 亿总就诊次数的大型 EHR 数据集。我们计算了在给定周内观察频率是否为异常信号的预测的敏感性和特异性。
与简单的频率监测方法相比,我们提出的多变量 LSTM 方法将 6 个随机选择的诊断代码中异常信号的检测敏感性从 75%提高到 88%,特异性从 68%提高到 91%。我们还尝试了两种不同的 LSTM 算法,即直接多步和递归多步。两种模型都能够检测到异常信号,而递归多步算法的性能更好。
仅仅监测频率趋势,就像系统中常见的监测数据质量的做法一样,无法区分季节性疾病变化、季节性患者就诊或数据源变化引起的波动。我们的研究表明,堆叠多变量 LSTM 模型能够识别真正的数据质量问题,而不是由季节性变化和爆发等不同原因引起的波动。