University of Virginia, Charlottesville, VA, United States.
JMIR Public Health Surveill. 2020 Sep 4;6(3):e12842. doi: 10.2196/12842.
Agencies such as the Centers for Disease Control and Prevention (CDC) currently release influenza-like illness incidence data, along with descriptive summaries of simple spatio-temporal patterns and trends. However, public health researchers, government agencies, as well as the general public, are often interested in deeper patterns and insights into how the disease is spreading, with additional context. Analysis by domain experts is needed for deriving such insights from incidence data.
Our goal was to develop an automated approach for finding interesting spatio-temporal patterns in the spread of a disease over a large region, such as regions which have specific characteristics (eg, high incidence in a particular week, those which showed a sudden change in incidence) or regions which have significantly different incidence compared to earlier seasons.
We developed techniques from the area of transactional data mining for characterizing and finding interesting spatio-temporal patterns in disease spread in an automated manner. A key part of our approach involved using the principle of minimum description length for representing a given target set in terms of combinations of attributes (referred to as clauses); we considered both positive and negative clauses, relaxed descriptions which approximately represent the set, and used integer programming to find such descriptions. Finally, we designed an automated approach, which examines a large space of sets corresponding to different spatio-temporal patterns, and ranks them based on the ratio of their size to their description length (referred to as the compression ratio).
We applied our methods using minimum description length to find spatio-temporal patterns in the spread of seasonal influenza in the United States using state level influenza-like illness activity indicator data from the CDC. We observed that the compression ratios were over 2.5 for 50% of the chosen sets, when approximate descriptions and negative clauses were allowed. Sets with high compression ratios (eg, over 2.5) corresponded to interesting patterns in the spatio-temporal dynamics of influenza-like illness. Our approach also outperformed description by solution in terms of the compression ratio.
Our approach, which is an unsupervised machine learning method, can provide new insights into patterns and trends in the disease spread in an automated manner. Our results show that the description complexity is an effective approach for characterizing sets of interest, which can be easily extended to other diseases and regions beyond influenza in the US. Our approach can also be easily adapted for automated generation of narratives.
疾病控制与预防中心(CDC)等机构目前发布流感样疾病发病率数据,并对简单的时空模式和趋势进行描述性总结。然而,公共卫生研究人员、政府机构以及公众通常对疾病传播的更深层次模式和见解感兴趣,并需要更多的背景信息。领域专家需要对发病率数据进行分析,以得出这些见解。
我们的目标是开发一种自动方法,以发现疾病在大区域内传播的有趣时空模式,例如具有特定特征的区域(例如,特定周内发病率较高,发病率突然变化的区域)或与早期季节相比发病率明显不同的区域。
我们从事务数据挖掘领域开发了技术,以自动描述和发现疾病传播中的有趣时空模式。我们方法的一个关键部分涉及使用最小描述长度原则,以属性组合的形式表示给定目标集(称为子句);我们考虑了正子句和负子句、近似表示集合的放宽描述,并使用整数编程来找到这样的描述。最后,我们设计了一种自动方法,该方法检查对应于不同时空模式的大空间集,并根据它们的大小与其描述长度的比率(称为压缩比)对其进行排名。
我们使用最小描述长度方法应用于美国季节性流感传播的时空模式,使用来自疾病预防控制中心的州级流感样疾病活动指标数据。我们观察到,在允许近似描述和负子句的情况下,50%的选定集合的压缩比超过 2.5。具有高压缩比(例如,超过 2.5)的集合对应于流感样疾病时空动态中的有趣模式。我们的方法在压缩比方面也优于解决方案描述。
我们的方法是一种无监督机器学习方法,可以自动提供有关疾病传播模式和趋势的新见解。我们的结果表明,描述复杂性是一种有效的方法,可以对感兴趣的集合进行特征描述,该方法可以很容易地扩展到美国以外的其他疾病和地区。我们的方法还可以轻松适应自动生成叙述。