IEEE Trans Vis Comput Graph. 2020 Jan;26(1):429-439. doi: 10.1109/TVCG.2019.2934209. Epub 2019 Aug 20.
The collection of large, complex datasets has become common across a wide variety of domains. Visual analytics tools increasingly play a key role in exploring and answering complex questions about these large datasets. However, many visualizations are not designed to concurrently visualize the large number of dimensions present in complex datasets (e.g. tens of thousands of distinct codes in an electronic health record system). This fact, combined with the ability of many visual analytics systems to enable rapid, ad-hoc specification of groups, or cohorts, of individuals based on a small subset of visualized dimensions, leads to the possibility of introducing selection bias-when the user creates a cohort based on a specified set of dimensions, differences across many other unseen dimensions may also be introduced. These unintended side effects may result in the cohort no longer being representative of the larger population intended to be studied, which can negatively affect the validity of subsequent analyses. We present techniques for selection bias tracking and visualization that can be incorporated into high-dimensional exploratory visual analytics systems, with a focus on medical data with existing data hierarchies. These techniques include: (1) tree-based cohort provenance and visualization, including a user-specified baseline cohort that all other cohorts are compared against, and visual encoding of cohort "drift", which indicates where selection bias may have occurred, and (2) a set of visualizations, including a novel icicle-plot based visualization, to compare in detail the per-dimension differences between the baseline and a user-specified focus cohort. These techniques are integrated into a medical temporal event sequence visual analytics tool. We present example use cases and report findings from domain expert user interviews.
在许多领域,收集大型复杂数据集已变得很常见。可视化分析工具在探索和回答这些大型数据集的复杂问题方面越来越发挥关键作用。然而,许多可视化工具并非设计用于同时可视化复杂数据集(例如电子健康记录系统中的数万个不同代码)中存在的大量维度。再加上许多可视化分析系统能够根据可视化维度的一小部分快速、临时指定个体的群组或队列的能力,就有可能引入选择偏差——当用户根据指定的维度集创建队列时,许多其他看不见的维度也可能存在差异。这些意外的副作用可能导致队列不再代表更大的研究人群,这会对后续分析的有效性产生负面影响。我们提出了用于选择偏差跟踪和可视化的技术,这些技术可以被整合到高维探索性可视化分析系统中,重点是具有现有数据层次结构的医疗数据。这些技术包括:(1)基于树的队列来源和可视化,包括用户指定的基线队列,所有其他队列都与之进行比较,以及队列“漂移”的可视化编码,这表明选择偏差可能发生的位置,(2)一组可视化,包括一种新颖的基于冰柱图的可视化,用于详细比较基线和用户指定的焦点队列之间的每个维度差异。这些技术集成到一个医疗时间事件序列可视化分析工具中。我们提出了示例用例,并报告了来自领域专家用户访谈的发现。