Vitsios Dimitrios M, Kentepozidou Elissavet, Quintais Leonor, Benito-Gutiérrez Elia, van Dongen Stijn, Davis Matthew P, Enright Anton J
European Molecular Biology Laboratory-European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK.
Department of Zoology, University of Cambridge, Downing Street, Cambridge CB2 3EJ, UK.
Nucleic Acids Res. 2017 Dec 1;45(21):e177. doi: 10.1093/nar/gkx836.
The discovery of microRNAs (miRNAs) remains an important problem, particularly given the growth of high-throughput sequencing, cell sorting and single cell biology. While a large number of miRNAs have already been annotated, there may well be large numbers of miRNAs that are expressed in very particular cell types and remain elusive. Sequencing allows us to quickly and accurately identify the expression of known miRNAs from small RNA-Seq data. The biogenesis of miRNAs leads to very specific characteristics observed in their sequences. In brief, miRNAs usually have a well-defined 5' end and a more flexible 3' end with the possibility of 3' tailing events, such as uridylation. Previous approaches to the prediction of novel miRNAs usually involve the analysis of structural features of miRNA precursor hairpin sequences obtained from genome sequence. We surmised that it may be possible to identify miRNAs by using these biogenesis features observed directly from sequenced reads, solely or in addition to structural analysis from genome data. To this end, we have developed mirnovo, a machine learning based algorithm, which is able to identify known and novel miRNAs in animals and plants directly from small RNA-Seq data, with or without a reference genome. This method performs comparably to existing tools, however is simpler to use with reduced run time. Its performance and accuracy has been tested on multiple datasets, including species with poorly assembled genomes, RNaseIII (Drosha and/or Dicer) deficient samples and single cells (at both embryonic and adult stage).
微小RNA(miRNA)的发现仍然是一个重要问题,特别是考虑到高通量测序、细胞分选和单细胞生物学的发展。虽然已经注释了大量的miRNA,但很可能存在大量在非常特定的细胞类型中表达且仍难以捉摸的miRNA。测序使我们能够从小RNA测序数据中快速准确地识别已知miRNA的表达。miRNA的生物合成导致在其序列中观察到非常特定的特征。简而言之,miRNA通常具有明确的5'端和更灵活的3'端,可能发生3'端加尾事件,如尿苷化。以前预测新miRNA的方法通常涉及分析从基因组序列获得的miRNA前体发夹序列的结构特征。我们推测,有可能通过直接从测序读数中观察到的这些生物合成特征来识别miRNA,单独使用或结合基因组数据的结构分析。为此,我们开发了mirnovo,一种基于机器学习的算法,它能够直接从小RNA测序数据中识别动植物中的已知和新miRNA,无论有无参考基因组。该方法与现有工具的性能相当,但使用更简单,运行时间更短。它的性能和准确性已在多个数据集上进行了测试,包括基因组组装不佳的物种、RNaseIII(Drosha和/或Dicer)缺陷样本和单细胞(胚胎期和成年期)。