Department of Systems and Computer Engineering, Carleton University, Ottawa, Ontario, Canada.
Sci Rep. 2019 Jul 29;9(1):10931. doi: 10.1038/s41598-019-47399-8.
MicroRNA (miRNA) are short, non-coding RNAs involved in cell regulation at post-transcriptional and translational levels. Numerous computational predictors of miRNA been developed that generally classify miRNA based on either sequence- or expression-based features. While these methods are highly effective, they require large labelled training data sets, which are often not available for many species. Simultaneously, emerging high-throughput wet-lab experimental procedures are producing large unlabelled data sets of genomic sequence and RNA expression profiles. Existing methods use supervised machine learning and are therefore unable to leverage these unlabelled data. In this paper, we design and develop a multi-view co-training approach for the classification of miRNA to maximize the utility of unlabelled training data by taking advantage of multiple views of the problem. Starting with only 10 labelled training data, co-training is shown to significantly (p < 0.01) increase classification accuracy of both sequence- and expression-based classifiers, without requiring any new labelled training data. After 11 iterations of co-training, the expression-based view of miRNA classification experiences an average increase in AUPRC of 15.81% over six species, compared to 11.90% for self-training and 4.84% for passive learning. Similar results are observed for sequence-based classifiers with increases of 46.47%, 39.53% and 29.43%, for co-training, self-training, and passive learning, respectively. The final co-trained sequence and expression-based classifiers are integrated into a final confidence-based classifier which shows improved performance compared to both the expression (1.5%, p = 0.021) and sequence (3.7%, p = 0.006) views. This study represents the first application of multi-view co-training to miRNA prediction and shows great promise, particularly for understudied species with few available training data.
miRNA(microRNA)是参与转录后和翻译水平细胞调控的短非编码 RNA。已经开发出许多 miRNA 的计算预测器,这些预测器通常基于序列或表达特征对 miRNA 进行分类。虽然这些方法非常有效,但它们需要大量标记的训练数据集,而这些数据集通常不适用于许多物种。同时,新兴的高通量湿实验室实验程序正在产生大量未标记的基因组序列和 RNA 表达谱数据集。现有的方法使用监督机器学习,因此无法利用这些未标记的数据。在本文中,我们设计并开发了一种 miRNA 分类的多视图协同训练方法,通过利用问题的多个视图来最大限度地利用未标记的训练数据。仅使用 10 个标记的训练数据,协同训练显著(p<0.01)提高了基于序列和基于表达的分类器的分类准确性,而无需任何新的标记训练数据。经过 11 次协同训练迭代,与自我训练(11.90%)和被动学习(4.84%)相比,基于表达的 miRNA 分类的平均 AUPRC 增加了 15.81%,而基于序列的分类器的平均 AUPRC 增加了 46.47%、39.53%和 29.43%。对于序列分类器,协同训练、自我训练和被动学习的分别增加了 46.47%、39.53%和 29.43%。最后,将协同训练的序列和基于表达的分类器集成到一个最终的置信度分类器中,与基于表达(1.5%,p=0.021)和基于序列(3.7%,p=0.006)的分类器相比,该分类器的性能得到了提高。本研究代表了多视图协同训练在 miRNA 预测中的首次应用,具有很大的潜力,特别是对于可用训练数据较少的研究较少的物种。