Genome Informatics Section, Center for Genomics and Data Science Research, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, United States.
Department of Computer Science, Johns Hopkins University, Baltimore, MD 21211, United States.
Bioinformatics. 2024 Aug 2;40(8). doi: 10.1093/bioinformatics/btae493.
A common method for analyzing genomic repeats is to produce a sequence similarity matrix visualized via a dot plot. Innovative approaches such as StainedGlass have improved upon this classic visualization by rendering dot plots as a heatmap of sequence identity, enabling researchers to better visualize multi-megabase tandem repeat arrays within centromeres and other heterochromatic regions of the genome. However, computing the similarity estimates for heatmaps requires high computational overhead and can suffer from decreasing accuracy.
In this work, we introduce ModDotPlot, an interactive and alignment-free dot plot viewer. By approximating average nucleotide identity via a k-mer-based containment index, ModDotPlot produces accurate plots orders of magnitude faster than StainedGlass. We accomplish this through the use of a hierarchical modimizer scheme that can visualize the full 128 Mb genome of Arabidopsis thaliana in under 5 min on a laptop. ModDotPlot is bundled with a graphical user interface supporting real-time interactive navigation of entire chromosomes.
ModDotPlot is available at https://github.com/marbl/ModDotPlot.
分析基因组重复序列的一种常用方法是生成序列相似性矩阵,通过点图可视化。StainedGlass 等创新方法通过将点图渲染为序列同一性的热图,改进了这种经典可视化,使研究人员能够更好地可视化着丝粒和基因组其他异染色质区域内的多兆碱基串联重复阵列。然而,计算热图的相似度估计值需要很高的计算开销,并且可能会降低准确性。
在这项工作中,我们引入了 ModDotPlot,这是一种交互式的、无需对齐的点图查看器。通过使用基于 k-mer 的包含指数来近似平均核苷酸同一性,ModDotPlot 可以以比 StainedGlass 快几个数量级的速度生成准确的图谱。我们通过使用分层 modimizer 方案来实现这一点,该方案可以在笔记本电脑上在不到 5 分钟的时间内可视化拟南芥完整的 128 Mb 基因组。ModDotPlot 随附有一个图形用户界面,支持实时交互式导航整个染色体。
ModDotPlot 可在 https://github.com/marbl/ModDotPlot 上获得。