Human Biology Division, Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA, Molecular and Cellular Biology Program, University of Washington, Seattle, Washington, 98105, USA, Clinical Research Division, Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA, Department of Pediatrics, School of Medicine, Department of Neurology, School of Medicine, University of Washington, Seattle, Washington, 98105, USA, Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA, Department of Computer Science and Engineering, Department of Genome Sciences, University of Washington, Seattle, Washington, 98105, USA and Bioinformatics and Computational Biology, Genentech, South San Francisco, CA 94080, USA.
Bioinformatics. 2014 Mar 15;30(6):775-83. doi: 10.1093/bioinformatics/btt615. Epub 2013 Oct 25.
High-throughput ChIP-seq studies typically identify thousands of peaks for a single transcription factor (TF). It is common for traditional motif discovery tools to predict motifs that are statistically significant against a naïve background distribution but are of questionable biological relevance.
We describe a simple yet effective algorithm for discovering differential motifs between two sequence datasets that is effective in eliminating systematic biases and scalable to large datasets. Tested on 207 ENCODE ChIP-seq datasets, our method identifies correct motifs in 78% of the datasets with known motifs, demonstrating improvement in both accuracy and efficiency compared with DREME, another state-of-art discriminative motif discovery tool. More interestingly, on the remaining more challenging datasets, we identify common technical or biological factors that compromise the motif search results and use advanced features of our tool to control for these factors. We also present case studies demonstrating the ability of our method to detect single base pair differences in DNA specificity of two similar TFs. Lastly, we demonstrate discovery of key TF motifs involved in tissue specification by examination of high-throughput DNase accessibility data.
The motifRG package is publically available via the bioconductor repository.
Supplementary data are available at Bioinformatics online.
高通量 ChIP-seq 研究通常为单个转录因子 (TF) 鉴定数千个峰。传统的基序发现工具通常会预测在原始背景分布下具有统计学意义的基序,但这些基序的生物学相关性值得怀疑。
我们描述了一种简单而有效的算法,用于发现两个序列数据集之间的差异基序,该算法能够有效地消除系统偏差,并且可扩展到大型数据集。在 207 个 ENCODE ChIP-seq 数据集上进行测试,我们的方法在具有已知基序的 78%的数据集上正确识别基序,与另一种先进的判别基序发现工具 DREME 相比,在准确性和效率方面都有所提高。更有趣的是,对于剩下的更具挑战性的数据集,我们确定了影响基序搜索结果的常见技术或生物学因素,并利用我们工具的高级特性来控制这些因素。我们还展示了案例研究,证明了我们的方法能够检测两个相似 TF 的 DNA 特异性中的单个碱基对差异。最后,我们通过检查高通量 DNase 可及性数据,发现了参与组织特化的关键 TF 基序。
motifRG 包可通过 bioconductor 存储库公开获得。
补充数据可在 Bioinformatics 在线获得。