Department of Statistics, University of Chicago, Chicago, 60637, USA.
Department of Human Genetics, university of Chicago, Chicago, 60637, USA.
BMC Bioinformatics. 2018 Dec 10;19(1):473. doi: 10.1186/s12859-018-2489-3.
Sequence logo plots have become a standard graphical tool for visualizing sequence motifs in DNA, RNA or protein sequences. However standard logo plots primarily highlight enrichment of symbols, and may fail to highlight interesting depletions. Current alternatives that try to highlight depletion often produce visually cluttered logos.
We introduce a new sequence logo plot, the EDLogo plot, that highlights both enrichment and depletion, while minimizing visual clutter. We provide an easy-to-use and highly customizable R package Logolas to produce a range of logo plots, including EDLogo plots. This software also allows elements in the logo plot to be strings of characters, rather than a single character, extending the range of applications beyond the usual DNA, RNA or protein sequences. And the software includes new Empirical Bayes methods to stabilize estimates of enrichment and depletion, and thus better highlight the most significant patterns in data. We illustrate our methods and software on applications to transcription factor binding site motifs, protein sequence alignments and cancer mutation signature profiles.
Our new EDLogo plots and flexible software implementation can help data analysts visualize both enrichment and depletion of characters (DNA sequence bases, amino acids, etc.) across a wide range of applications.
序列 logo 图已成为可视化 DNA、RNA 或蛋白质序列中序列基序的标准图形工具。然而,标准的 logo 图主要突出符号的富集,可能无法突出有趣的缺失。目前尝试突出缺失的替代方案通常会产生视觉混乱的 logo。
我们引入了一种新的序列 logo 图,即 EDLogo 图,它突出了富集和缺失,同时最大限度地减少了视觉混乱。我们提供了一个易于使用且高度可定制的 R 包 Logolas,可生成一系列 logo 图,包括 EDLogo 图。该软件还允许 logo 图中的元素是字符的字符串,而不是单个字符,从而将应用范围扩展到通常的 DNA、RNA 或蛋白质序列之外。并且该软件包括新的经验贝叶斯方法来稳定富集和缺失的估计,从而更好地突出数据中最重要的模式。我们将这些方法和软件应用于转录因子结合位点基序、蛋白质序列比对和癌症突变特征谱的应用进行了说明。
我们新的 EDLogo 图和灵活的软件实现可以帮助数据分析人员在广泛的应用中可视化字符(DNA 序列碱基、氨基酸等)的富集和缺失。