Heller David, Krestel Ralf, Ohler Uwe, Vingron Martin, Marsico Annalisa
Max Planck Institute for Molecular Genetics, Ihnestr. 63-73 14195 Berlin, Germany.
Hasso Plattner Institute, Prof.-Dr.-Helmert-Str. 2-3 14482 Potsdam, Germany.
Nucleic Acids Res. 2017 Nov 2;45(19):11004-11018. doi: 10.1093/nar/gkx756.
RNA-binding proteins (RBPs) play an important role in RNA post-transcriptional regulation and recognize target RNAs via sequence-structure motifs. The extent to which RNA structure influences protein binding in the presence or absence of a sequence motif is still poorly understood. Existing RNA motif finders either take the structure of the RNA only partially into account, or employ models which are not directly interpretable as sequence-structure motifs. We developed ssHMM, an RNA motif finder based on a hidden Markov model (HMM) and Gibbs sampling which fully captures the relationship between RNA sequence and secondary structure preference of a given RBP. Compared to previous methods which output separate logos for sequence and structure, it directly produces a combined sequence-structure motif when trained on a large set of sequences. ssHMM's model is visualized intuitively as a graph and facilitates biological interpretation. ssHMM can be used to find novel bona fide sequence-structure motifs of uncharacterized RBPs, such as the one presented here for the YY1 protein. ssHMM reaches a high motif recovery rate on synthetic data, it recovers known RBP motifs from CLIP-Seq data, and scales linearly on the input size, being considerably faster than MEMERIS and RNAcontext on large datasets while being on par with GraphProt. It is freely available on Github and as a Docker image.
RNA结合蛋白(RBPs)在RNA转录后调控中发挥着重要作用,并通过序列-结构基序识别靶RNA。在存在或不存在序列基序的情况下,RNA结构对蛋白质结合的影响程度仍知之甚少。现有的RNA基序查找工具要么只部分考虑RNA的结构,要么采用不能直接解释为序列-结构基序的模型。我们开发了ssHMM,这是一种基于隐马尔可夫模型(HMM)和吉布斯采样的RNA基序查找工具,它能完全捕捉RNA序列与给定RBP的二级结构偏好之间的关系。与之前输出序列和结构单独标识的方法相比,在对大量序列进行训练时,它能直接生成组合的序列-结构基序。ssHMM的模型可以直观地可视化为图形,便于生物学解释。ssHMM可用于查找未表征RBP的新型真实序列-结构基序,例如此处展示的YY1蛋白的基序。ssHMM在合成数据上具有较高的基序回收率,能从CLIP-Seq数据中恢复已知的RBP基序,并且在输入大小上呈线性扩展,在大型数据集上比MEMERIS和RNAcontext快得多,同时与GraphProt相当。它可在Github上免费获取,也可作为Docker镜像使用。