Wootton Dylan, Fox Amy Rae, Peck Evan, Satyanarayan Arvind
IEEE Trans Vis Comput Graph. 2025 Jan;31(1):1191-1201. doi: 10.1109/TVCG.2024.3456217. Epub 2024 Nov 28.
Interactive visualizations are powerful tools for Exploratory Data Analysis (EDA), but how do they affect the observations analysts make about their data? We conducted a qualitative experiment with 13 professional data scientists analyzing two datasets with Jupyter notebooks, collecting a rich dataset of interaction traces and think-aloud utterances. By qualitatively coding participant utterances, we introduce a formalism that describes EDA as a sequence of analysis states, where each state is comprised of either a representation an analyst constructs (e.g., the output of a data frame, an interactive visualization, etc.) or an observation the analyst makes (e.g., about missing data, the relationship between variables, etc.). By applying our formalism to our dataset, we identify that interactive visualizations, on average, lead to earlier and more complex insights about relationships between dataset attributes compared to static visualizations. Moreover, by calculating metrics such as revisit count and representational diversity, we uncover that some representations serve more as "planning aids" during EDA rather than tools strictly for hypothesis-answering. We show how these measures help identify other patterns of analysis behavior, such as the "80-20 rule", where a small subset of representations drove the majority of observations. Based on these findings, we offer design guidelines for interactive exploratory analysis tooling and reflect on future directions for studying the role that visualizations play in EDA.
交互式可视化是探索性数据分析(EDA)的强大工具,但它们如何影响分析师对其数据的观察呢?我们对13名专业数据科学家进行了一项定性实验,他们使用Jupyter笔记本分析两个数据集,收集了丰富的交互痕迹和出声思考话语数据集。通过对参与者话语进行定性编码,我们引入了一种形式主义,将EDA描述为一系列分析状态,其中每个状态由分析师构建的一种表示(例如,数据框的输出、交互式可视化等)或分析师做出的一个观察(例如,关于缺失数据、变量之间的关系等)组成。通过将我们的形式主义应用于我们的数据集,我们发现,与静态可视化相比,交互式可视化平均而言能更早、更复杂地洞察数据集属性之间的关系。此外,通过计算诸如重访次数和表示多样性等指标,我们发现一些表示在EDA期间更多地作为“规划辅助工具”,而不是严格用于回答假设的工具。我们展示了这些措施如何帮助识别其他分析行为模式,例如“80-20规则”,即一小部分表示驱动了大部分观察。基于这些发现,我们为交互式探索性分析工具提供了设计指南,并思考了研究可视化在EDA中所起作用的未来方向。