Xu Jian-Hua, Li Fei, Sun Qiu-Feng
Department of Computer Science, Nanjing Normal University, Nanjing 210097, China.
Genomics Proteomics Bioinformatics. 2008 Jun;6(2):121-8. doi: 10.1016/S1672-0229(08)60027-3.
MicroRNAs (miRNAs) are one family of short (21-23 nt) regulatory non-coding RNAs processed from long (70-110 nt) miRNA precursors (pre-miRNAs). Identifying true and false precursors plays an important role in computational identification of miRNAs. Some numerical features have been extracted from precursor sequences and their secondary structures to suit some classification methods; however, they may lose some usefully discriminative information hidden in sequences and structures. In this study, pre-miRNA sequences and their secondary structures are directly used to construct an exponential kernel based on weighted Levenshtein distance between two sequences. This string kernel is then combined with support vector machine (SVM) for detecting true and false pre-miRNAs. Based on 331 training samples of true and false human pre-miRNAs, 2 key parameters in SVM are selected by 5-fold cross validation and grid search, and 5 realizations with different 5-fold partitions are executed. Among 16 independent test sets from 3 human, 8 animal, 2 plant, 1 virus, and 2 artificially false human pre-miRNAs, our method statistically outperforms the previous SVM-based technique on 11 sets, including 3 human, 7 animal, and 1 false human pre-miRNAs. In particular, premiRNAs with multiple loops that were usually excluded in the previous work are correctly identified in this study with an accuracy of 92.66%.
微小RNA(miRNA)是一类短的(21 - 23个核苷酸)调控性非编码RNA,由长的(70 - 110个核苷酸)miRNA前体(pre - miRNA)加工而来。识别真假前体在miRNA的计算识别中起着重要作用。已经从前体序列及其二级结构中提取了一些数值特征以适用于某些分类方法;然而,它们可能会丢失隐藏在序列和结构中的一些有用的判别信息。在本研究中,pre - miRNA序列及其二级结构直接用于基于两个序列之间的加权莱文斯坦距离构建指数核。然后将此字符串核与支持向量机(SVM)相结合以检测真假pre - miRNA。基于331个真假人类pre - miRNA的训练样本,通过5折交叉验证和网格搜索选择SVM中的2个关键参数,并执行5次具有不同5折划分的实现。在来自3个人类、8个动物、2个植物、1个病毒和2个人工伪造的人类pre - miRNA的16个独立测试集中,我们的方法在11个数据集上在统计学上优于先前基于SVM的技术,包括3个人类、7个动物和1个人造伪造的人类pre - miRNA。特别是,本研究中正确识别了通常在先前工作中被排除的具有多个环的pre - miRNA,准确率为92.66%。