Homayouni Hajar, Ray Indrakshi, Ghosh Sudipto, Gondalia Shlok, Kahn Michael G
Computer Science Department, Colorado State University, Fort Collins, CO 80523 USA.
Anschutz Medical Campus, University of Colorado Denver, Aurora, CO 80045 USA.
SN Comput Sci. 2021;2(4):279. doi: 10.1007/s42979-021-00658-w. Epub 2021 May 19.
Anomaly detection and explanation in big volumes of real-world medical data, such as those pertaining to COVID-19, pose some challenges. First, we are dealing with time-series data. Typical time-series data describe behavior of a single object over time. In medical data, we are dealing with time-series data belonging to multiple entities. Thus, there may be multiple subsets of records such that records in each subset, which belong to a single entity are temporally dependent, but the records in different subsets are unrelated. Moreover, the records in a subset contain different types of attributes, some of which must be grouped in a particular manner to make the analysis meaningful. Anomaly detection techniques need to be customized for time-series data belonging to multiple entities. Second, anomaly detection techniques fail to explain the cause of outliers to the experts. This is critical for new diseases and pandemics where current knowledge is insufficient. We propose to address these issues by extending our existing work called IDEAL, which is an LSTM-autoencoder based approach for data quality testing of sequential records, and provides explanations of constraint violations in a manner that is understandable to end-users. The extension (1) uses a novel two-level reshaping technique that splits COVID-19 data sets into multiple temporally-dependent subsequences and (2) adds a data visualization plot to further explain the anomalies and evaluate the level of abnormality of subsequences detected by IDEAL. We performed two systematic evaluation studies for our anomalous subsequence detection. One study uses aggregate data, including the number of cases, deaths, recovered, and percentage of hospitalization rate, collected from a COVID tracking project, New York Times, and Johns Hopkins for the same time period. The other study uses COVID-19 patient medical records obtained from Anschutz Medical Center health data warehouse. The results are promising and indicate that our techniques can be used to detect anomalies in large volumes of real-world unlabeled data whose accuracy or validity is unknown.
在大量真实世界的医学数据(如与新冠肺炎相关的数据)中进行异常检测和解释存在一些挑战。首先,我们处理的是时间序列数据。典型的时间序列数据描述单个对象随时间的行为。在医学数据中,我们处理的是属于多个实体的时间序列数据。因此,可能存在多个记录子集,使得每个子集中属于单个实体的记录在时间上是相关的,但不同子集中的记录是不相关的。此外,子集中的记录包含不同类型的属性,其中一些属性必须以特定方式分组才能使分析有意义。异常检测技术需要针对属于多个实体的时间序列数据进行定制。其次,异常检测技术无法向专家解释异常值的原因。这对于当前知识不足的新疾病和大流行至关重要。我们建议通过扩展我们现有的名为IDEAL的工作来解决这些问题,IDEAL是一种基于长短期记忆自动编码器的方法,用于对顺序记录进行数据质量测试,并以最终用户可理解的方式提供约束违反的解释。该扩展(1)使用一种新颖的两级重塑技术,将新冠肺炎数据集拆分为多个时间相关的子序列,(2)添加一个数据可视化图,以进一步解释异常情况并评估IDEAL检测到的子序列的异常程度。我们对异常子序列检测进行了两项系统评估研究。一项研究使用汇总数据,包括从新冠肺炎追踪项目、《纽约时报》和约翰·霍普金斯大学在同一时期收集的病例数、死亡数、康复数和住院率百分比。另一项研究使用从安舒茨医疗中心健康数据仓库获得的新冠肺炎患者病历。结果很有希望,表明我们的技术可用于检测大量真实世界未标记数据中的异常情况,这些数据的准确性或有效性未知。