Karnik Rahul, Beer Michael A
Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, United States of America.
Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, United States of America; McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University, Baltimore, MD, United States of America.
PLoS One. 2015 Oct 14;10(10):e0140557. doi: 10.1371/journal.pone.0140557. eCollection 2015.
The generation of genomic binding or accessibility data from massively parallel sequencing technologies such as ChIP-seq and DNase-seq continues to accelerate. Yet state-of-the-art computational approaches for the identification of DNA binding motifs often yield motifs of weak predictive power. Here we present a novel computational algorithm called MotifSpec, designed to find predictive motifs, in contrast to over-represented sequence elements. The key distinguishing feature of this algorithm is that it uses a dynamic search space and a learned threshold to find discriminative motifs in combination with the modeling of motifs using a full PWM (position weight matrix) rather than k-mer words or regular expressions. We demonstrate that our approach finds motifs corresponding to known binding specificities in several mammalian ChIP-seq datasets, and that our PWMs classify the ChIP-seq signals with accuracy comparable to, or marginally better than motifs from the best existing algorithms. In other datasets, our algorithm identifies novel motifs where other methods fail. Finally, we apply this algorithm to detect motifs from expression datasets in C. elegans using a dynamic expression similarity metric rather than fixed expression clusters, and find novel predictive motifs.
通过ChIP-seq和DNase-seq等大规模平行测序技术生成基因组结合或可及性数据的速度持续加快。然而,用于识别DNA结合基序的最先进计算方法往往产生预测能力较弱的基序。在此,我们提出一种名为MotifSpec的新型计算算法,旨在找到具有预测性的基序,而非过度呈现的序列元件。该算法的关键区别特征在于,它使用动态搜索空间和学习到的阈值来寻找具有判别性的基序,并结合使用完整的位置权重矩阵(PWM)对基序进行建模,而不是使用k-mer词或正则表达式。我们证明,我们的方法在几个哺乳动物ChIP-seq数据集中找到了与已知结合特异性相对应的基序,并且我们的PWM对ChIP-seq信号进行分类的准确性与现有最佳算法的基序相当,或略胜一筹。在其他数据集中,我们的算法识别出了其他方法未能发现的新基序。最后,我们应用该算法,使用动态表达相似性度量而非固定表达簇来检测秀丽隐杆线虫表达数据集中的基序,并发现了新的预测基序。