RNA Biology and Plasticity Group, Garvan Institute of Medical Research, 384 Victoria Street, Sydney, NSW 2010, Australia.
St Vincent's Clinical School, Faculty of Medicine, UNSW Australia, Sydney, NSW 2010, Australia.
Genome Biol. 2017 Dec 28;18(1):244. doi: 10.1186/s13059-017-1371-3.
The diversity of processed transcripts in eukaryotic genomes poses a challenge for the classification of their biological functions. Sparse sequence conservation in non-coding sequences and the unreliable nature of RNA structure predictions further exacerbate this conundrum. Here, we describe a computational method, DotAligner, for the unsupervised discovery and classification of homologous RNA structure motifs from a set of sequences of interest. Our approach outperforms comparable algorithms at clustering known RNA structure families, both in speed and accuracy. It identifies clusters of known and novel structure motifs from ENCODE immunoprecipitation data for 44 RNA-binding proteins.
真核基因组中加工转录本的多样性给它们的生物功能分类带来了挑战。非编码序列中稀疏的序列保守性和 RNA 结构预测的不可靠性进一步加剧了这一难题。在这里,我们描述了一种计算方法 DotAligner,用于从一组感兴趣的序列中无监督地发现和分类同源 RNA 结构基序。我们的方法在聚类已知的 RNA 结构家族方面优于可比的算法,无论是在速度还是准确性方面。它从 44 种 RNA 结合蛋白的 ENCODE 免疫沉淀数据中识别出已知和新型结构基序的聚类。