Division of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, Arkansas, United States of America.
PLoS One. 2012;7(9):e44878. doi: 10.1371/journal.pone.0044878. Epub 2012 Sep 13.
Among thousands of long non-coding RNAs (lncRNAs) only a small subset is functionally characterized and the functional annotation of lncRNAs on the genomic scale remains inadequate. In this study we computationally characterized two functionally different parts of human lncRNAs transcriptome based on their ability to bind the polycomb repressive complex, PRC2. This classification is enabled by the fact that while all lncRNAs constitute a diverse set of sequences, the classes of PRC2-binding and PRC2 non-binding lncRNAs possess characteristic combinations of sequence-structure patterns and, therefore, can be separated within the feature space. Based on the specific combination of features, we built several machine-learning classifiers and identified the SVM-based classifier as the best performing. We further showed that the SVM-based classifier is able to generalize on the independent data sets. We observed that this classifier, trained on the human lncRNAs, can predict up to 59.4% of PRC2-binding lncRNAs in mice. This suggests that, despite the low degree of sequence conservation, many lncRNAs play functionally conserved biological roles.
在数以千计的长非编码 RNA(lncRNA)中,只有一小部分具有功能特征,lncRNA 在基因组范围内的功能注释仍然不足。在这项研究中,我们基于其与多梳抑制复合物 PRC2 结合的能力,计算了人类 lncRNA 转录组的两个具有不同功能的部分。这种分类是基于这样一个事实,即虽然所有 lncRNA 构成了一个多样化的序列集合,但 PRC2 结合和 PRC2 非结合 lncRNA 的类别具有特征性的序列-结构模式组合,因此可以在特征空间中分离。基于特定的特征组合,我们构建了几个机器学习分类器,并发现基于 SVM 的分类器性能最佳。我们进一步表明,基于 SVM 的分类器能够在独立数据集上进行泛化。我们观察到,这种在人类 lncRNA 上训练的分类器可以预测高达 59.4%的小鼠 PRC2 结合 lncRNA。这表明,尽管序列保守性低,但许多 lncRNA 发挥着功能保守的生物学作用。