sinc(i), Research Institute for Signals, Systems and Computational Intelligence (CONICET-UNL), Ciudad Universitaria, Santa Fe, Argentina.
Brief Bioinform. 2019 Sep 27;20(5):1607-1620. doi: 10.1093/bib/bby037.
The importance of microRNAs (miRNAs) is widely recognized in the community nowadays because these short segments of RNA can play several roles in almost all biological processes. The computational prediction of novel miRNAs involves training a classifier for identifying sequences having the highest chance of being precursors of miRNAs (pre-miRNAs). The big issue with this task is that well-known pre-miRNAs are usually few in comparison with the hundreds of thousands of candidate sequences in a genome, which results in high class imbalance. This imbalance has a strong influence on most standard classifiers, and if not properly addressed in the model and the experiments, not only performance reported can be completely unrealistic but also the classifier will not be able to work properly for pre-miRNA prediction. Besides, another important issue is that for most of the machine learning (ML) approaches already used (supervised methods), it is necessary to have both positive and negative examples. The selection of positive examples is straightforward (well-known pre-miRNAs). However, it is difficult to build a representative set of negative examples because they should be sequences with hairpin structure that do not contain a pre-miRNA.
This review provides a comprehensive study and comparative assessment of methods from these two ML approaches for dealing with the prediction of novel pre-miRNAs: supervised and unsupervised training. We present and analyze the ML proposals that have appeared during the past 10 years in literature. They have been compared in several prediction tasks involving two model genomes and increasing imbalance levels. This work provides a review of existing ML approaches for pre-miRNA prediction and fair comparisons of the classifiers with same features and data sets, instead of just a revision of published software tools. The results and the discussion can help the community to select the most adequate bioinformatics approach according to the prediction task at hand. The comparative results obtained suggest that from low to mid-imbalance levels between classes, supervised methods can be the best. However, at very high imbalance levels, closer to real case scenarios, models including unsupervised and deep learning can provide better performance.
如今,miRNAs(microRNAs)的重要性在科学界得到了广泛认可,因为这些短片段的 RNA 可以在几乎所有的生物过程中发挥多种作用。新的 miRNAs 的计算预测涉及到训练一个分类器来识别具有成为 miRNA(pre-miRNAs)前体的最高机会的序列。这个任务的主要问题是,与基因组中数以十万计的候选序列相比,已知的 pre-miRNAs 通常很少,这导致了严重的类不平衡。这种不平衡对大多数标准分类器有很大的影响,如果在模型和实验中没有得到妥善处理,不仅报告的性能可能完全不现实,而且分类器也无法正常地进行 pre-miRNA 预测。此外,另一个重要的问题是,对于大多数已经使用的机器学习(ML)方法(监督方法)来说,需要同时有正例和负例。正例的选择是直接的(已知的 pre-miRNAs)。然而,构建一个具有代表性的负例集是很困难的,因为它们应该是具有发夹结构但不包含 pre-miRNA 的序列。
本文对这两种 ML 方法在处理新的 pre-miRNA 预测中的应用进行了全面的研究和比较评估:监督和无监督训练。我们提出并分析了过去 10 年来文献中出现的 ML 建议。它们在几个涉及两个模型基因组和增加不平衡水平的预测任务中进行了比较。这项工作提供了一个对现有 ML 方法的综述,用于预测 pre-miRNA,并对具有相同特征和数据集的分类器进行了公平比较,而不仅仅是对已发布软件工具的综述。结果和讨论可以帮助社区根据手头的预测任务选择最合适的生物信息学方法。得到的比较结果表明,在类之间的低到中不平衡水平下,监督方法可能是最好的。然而,在非常高的不平衡水平下,更接近实际情况,包括无监督和深度学习的模型可以提供更好的性能。