Domanska Diana, Vodák Daniel, Lund-Andersen Christin, Salvatore Stefania, Hovig Eivind, Sandve Geir Kjetil
Department of Informatics, University of Oslo, Oslo, Norway.
Department of Tumor Biology, Institute for Cancer Research, Oslo University Hospital, Oslo, Norway.
BMC Bioinformatics. 2017 May 18;18(1):264. doi: 10.1186/s12859-017-1679-8.
A visualization referred to as rainfall plot has recently gained popularity in genome data analysis. The plot is mostly used for illustrating the distribution of somatic cancer mutations along a reference genome, typically aiming to identify mutation hotspots. In general terms, the rainfall plot can be seen as a scatter plot showing the location of events on the x-axis versus the distance between consecutive events on the y-axis. Despite its frequent use, the motivation for applying this particular visualization and the appropriateness of its usage have never been critically addressed in detail.
We show that the rainfall plot allows visual detection even for events occurring at high frequency over very short distances. In addition, event clustering at multiple scales may be detected as distinct horizontal bands in rainfall plots. At the same time, due to the limited size of standard figures, rainfall plots might suffer from inability to distinguish overlapping events, especially when multiple datasets are plotted in the same figure. We demonstrate the consequences of plot congestion, which results in obscured visual data interpretations.
This work provides the first comprehensive survey of the characteristics and proper usage of rainfall plots. We find that the rainfall plot is able to convey a large amount of information without any need for parameterization or tuning. However, we also demonstrate how plot congestion and the use of a logarithmic y-axis may result in obscured visual data interpretations. To aid the productive utilization of rainfall plots, we demonstrate their characteristics and potential pitfalls using both simulated and real data, and provide a set of practical guidelines for their proper interpretation and usage.
一种称为降雨图的可视化方法最近在基因组数据分析中受到欢迎。该图主要用于说明体细胞癌突变沿参考基因组的分布,通常旨在识别突变热点。一般来说,降雨图可视为散点图,其中x轴表示事件的位置,y轴表示连续事件之间的距离。尽管其使用频繁,但应用这种特定可视化方法的动机及其使用的适当性从未得到详细的批判性探讨。
我们表明,即使对于在非常短的距离内高频发生的事件,降雨图也能实现视觉检测。此外,降雨图中多个尺度上的事件聚类可能表现为不同的水平带。同时,由于标准图形尺寸有限,降雨图可能无法区分重叠事件,尤其是在同一图形中绘制多个数据集时。我们展示了绘图拥塞的后果,这会导致视觉数据解释模糊。
这项工作首次全面审视了降雨图的特征和正确用法。我们发现降雨图能够在无需任何参数化或调整的情况下传达大量信息。然而,我们也展示了绘图拥塞和使用对数y轴如何导致视觉数据解释模糊。为了帮助有效利用降雨图,我们使用模拟数据和真实数据展示了它们的特征和潜在陷阱,并提供了一套关于其正确解释和使用的实用指南。