Gupta Sagar, Shankar Ravi
Studio of Computational Biology & Bioinformatics, The Himalayan Centre for High-throughput Computational Biology, (HiCHiCoB, A BIC supported by DBT, India), CSIR-Institute of Himalayan Bioresource Technology (CSIR-IHBT), Palampur, Himachal Pradesh 176061, India.
Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, Uttar Pradesh 201002, India.
Brief Bioinform. 2023 Mar 19;24(2). doi: 10.1093/bib/bbad088.
Discovering pre-microRNAs (miRNAs) is the core of miRNA discovery. Using traditional sequence/structural features, many tools have been published to discover miRNAs. However, in practical applications like genomic annotations, their actual performance has been very low. This becomes more grave in plants where unlike animals pre-miRNAs are much more complex and difficult to identify. A huge gap exists between animals and plants for the available software for miRNA discovery and species-specific miRNA information. Here, we present miWords, a composite deep learning system of transformers and convolutional neural networks which sees genome as a pool of sentences made of words with specific occurrence preferences and contexts, to accurately identify pre-miRNA regions across plant genomes. A comprehensive benchmarking was done involving >10 software representing different genre and many experimentally validated datasets. miWords emerged as the best one while breaching accuracy of 98% and performance lead of ~10%. miWords was also evaluated across Arabidopsis genome where also it outperformed the compared tools. As a demonstration, miWords was run across the tea genome, reporting 803 pre-miRNA regions, all validated by small RNA-seq reads from multiple samples, and most of them were functionally supported by the degradome sequencing data. miWords is freely available as stand-alone source codes at https://scbb.ihbt.res.in/miWords/index.php.
发现前体微小RNA(pre-miRNA)是微小RNA发现的核心。利用传统的序列/结构特征,已经发表了许多用于发现微小RNA的工具。然而,在基因组注释等实际应用中,它们的实际性能非常低。在植物中,这种情况变得更加严重,因为与动物不同,植物的前体微小RNA更加复杂,难以识别。在用于微小RNA发现的可用软件和物种特异性微小RNA信息方面,动植物之间存在巨大差距。在这里,我们展示了miWords,这是一种由变压器和卷积神经网络组成的复合深度学习系统,它将基因组视为由具有特定出现偏好和上下文的单词组成的句子库,以准确识别植物基因组中的前体微小RNA区域。我们进行了全面的基准测试,涉及代表不同类型的10多种软件和许多经过实验验证的数据集。miWords脱颖而出,准确率突破98%,性能领先约10%。我们还在拟南芥基因组上对miWords进行了评估,它在该基因组上也优于其他被比较的工具。作为演示,我们在茶树基因组上运行了miWords,报告了803个前体微小RNA区域,所有这些区域均由来自多个样本的小RNA测序读数验证,并且其中大多数在功能上得到了降解组测序数据的支持。miWords可作为独立的源代码在https://scbb.ihbt.res.in/miWords/index.php上免费获取。