School of Mathematics and Physics, University of Science and Technology Beijing, Beijing, PR China.
PLoS One. 2012;7(3):e32797. doi: 10.1371/journal.pone.0032797. Epub 2012 Mar 15.
Upwards of 1200 miRNA loci have hitherto been annotated in the human genome. The specific features defining a miRNA precursor and deciding its recognition and subsequent processing are not yet exhaustively described and miRNA loci can thus not be computationally identified with sufficient confidence.
We rendered pre-miRNA and non-pre-miRNA hairpins as strings of integrated sequence-structure information, and used the software Teiresias to identify sequence-structure motifs (ss-motifs) of variable length in these data sets. Using only ss-motifs as features in a Support Vector Machine (SVM) algorithm for pre-miRNA identification achieved 99.2% specificity and 97.6% sensitivity on a human test data set, which is comparable to previously published algorithms employing combinations of sequence-structure and additional features. Further analysis of the ss-motif information contents revealed strongly significant deviations from those of the respective training sets, revealing important potential clues as to how the sequence and structural information of RNA hairpins are utilized by the miRNA processing apparatus.
Integrated sequence-structure motifs of variable length apparently capture nearly all information required to distinguish miRNA precursors from other stem-loop structures.
迄今为止,人类基因组中已经注释了超过 1200 个 miRNA 基因座。定义 miRNA 前体并决定其识别和后续加工的具体特征尚未得到详尽描述,因此 miRNA 基因座不能通过计算以足够的置信度来识别。
我们将 miRNA 前体和非 miRNA 发夹表示为整合序列-结构信息的字符串,并使用软件 Teiresias 在这些数据集识别可变长度的序列-结构基序(ss-motif)。仅使用 ss-motif 作为支持向量机(SVM)算法的特征,用于 miRNA 前体识别,在人类测试数据集上达到 99.2%的特异性和 97.6%的灵敏度,与先前发表的使用序列-结构和其他特征组合的算法相当。对 ss-motif 信息含量的进一步分析显示,与各自的训练集存在明显的偏差,揭示了 miRNA 加工装置如何利用 RNA 发夹的序列和结构信息的重要潜在线索。
可变长度的整合序列-结构基序显然可以捕获区分 miRNA 前体和其他茎环结构所需的几乎所有信息。