Biological Engineering Program, King Mongkut's University of Technology Thonburi, Bang Mod, Thung Khru, Bangkok 10140, Thailand.
Nucleic Acids Res. 2013 Jan 7;41(1):e21. doi: 10.1093/nar/gks878. Epub 2012 Sep 24.
An ensemble classifier approach for microRNA precursor (pre-miRNA) classification was proposed based upon combining a set of heterogeneous algorithms including support vector machine (SVM), k-nearest neighbors (kNN) and random forest (RF), then aggregating their prediction through a voting system. Additionally, the proposed algorithm, the classification performance was also improved using discriminative features, self-containment and its derivatives, which have shown unique structural robustness characteristics of pre-miRNAs. These are applicable across different species. By applying preprocessing methods--both a correlation-based feature selection (CFS) with genetic algorithm (GA) search method and a modified-Synthetic Minority Oversampling Technique (SMOTE) bagging rebalancing method--improvement in the performance of this ensemble was observed. The overall prediction accuracies obtained via 10 runs of 5-fold cross validation (CV) was 96.54%, with sensitivity of 94.8% and specificity of 98.3%-this is better in trade-off sensitivity and specificity values than those of other state-of-the-art methods. The ensemble model was applied to animal, plant and virus pre-miRNA and achieved high accuracy, >93%. Exploiting the discriminative set of selected features also suggests that pre-miRNAs possess high intrinsic structural robustness as compared with other stem loops. Our heterogeneous ensemble method gave a relatively more reliable prediction than those using single classifiers. Our program is available at http://ncrna-pred.com/premiRNA.html.
基于组合一组异构算法,包括支持向量机 (SVM)、k 近邻 (kNN) 和随机森林 (RF),提出了一种用于 microRNA 前体 (pre-miRNA) 分类的集成分类器方法,然后通过投票系统对它们的预测进行聚合。此外,该算法还使用了判别特征、自包含及其衍生物来提高分类性能,这些特征展示了 pre-miRNA 独特的结构稳健性特征。这些特征适用于不同的物种。通过应用预处理方法——基于相关性的特征选择 (CFS) 与遗传算法 (GA) 搜索方法以及改进的 Synthetic Minority Oversampling Technique (SMOTE) 装袋重平衡方法——观察到了该集成的性能提高。通过 5 折交叉验证 (CV) 的 10 次运行获得的总体预测准确率为 96.54%,灵敏度为 94.8%,特异性为 98.3%-这在权衡灵敏度和特异性值方面优于其他最先进的方法。该集成模型应用于动物、植物和病毒 pre-miRNA,实现了高精度,>93%。利用所选特征的判别集还表明,与其他茎环相比,pre-miRNA 具有较高的内在结构稳健性。与使用单个分类器相比,我们的异构集成方法提供了相对更可靠的预测。我们的程序可在 http://ncrna-pred.com/premiRNA.html 获得。