Suppr超能文献

miWords:基于Transformer的复合深度学习,用于在植物基因组中高精度发现前体miRNA区域。

miWords: transformer-based composite deep learning for highly accurate discovery of pre-miRNA regions across plant genomes.

作者信息

Gupta Sagar, Shankar Ravi

机构信息

Studio of Computational Biology & Bioinformatics, The Himalayan Centre for High-throughput Computational Biology, (HiCHiCoB, A BIC supported by DBT, India), CSIR-Institute of Himalayan Bioresource Technology (CSIR-IHBT), Palampur, Himachal Pradesh 176061, India.

Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, Uttar Pradesh 201002, India.

出版信息

Brief Bioinform. 2023 Mar 19;24(2). doi: 10.1093/bib/bbad088.

Abstract

Discovering pre-microRNAs (miRNAs) is the core of miRNA discovery. Using traditional sequence/structural features, many tools have been published to discover miRNAs. However, in practical applications like genomic annotations, their actual performance has been very low. This becomes more grave in plants where unlike animals pre-miRNAs are much more complex and difficult to identify. A huge gap exists between animals and plants for the available software for miRNA discovery and species-specific miRNA information. Here, we present miWords, a composite deep learning system of transformers and convolutional neural networks which sees genome as a pool of sentences made of words with specific occurrence preferences and contexts, to accurately identify pre-miRNA regions across plant genomes. A comprehensive benchmarking was done involving >10 software representing different genre and many experimentally validated datasets. miWords emerged as the best one while breaching accuracy of 98% and performance lead of ~10%. miWords was also evaluated across Arabidopsis genome where also it outperformed the compared tools. As a demonstration, miWords was run across the tea genome, reporting 803 pre-miRNA regions, all validated by small RNA-seq reads from multiple samples, and most of them were functionally supported by the degradome sequencing data. miWords is freely available as stand-alone source codes at https://scbb.ihbt.res.in/miWords/index.php.

摘要

发现前体微小RNA(pre-miRNA)是微小RNA发现的核心。利用传统的序列/结构特征,已经发表了许多用于发现微小RNA的工具。然而,在基因组注释等实际应用中,它们的实际性能非常低。在植物中,这种情况变得更加严重,因为与动物不同,植物的前体微小RNA更加复杂,难以识别。在用于微小RNA发现的可用软件和物种特异性微小RNA信息方面,动植物之间存在巨大差距。在这里,我们展示了miWords,这是一种由变压器和卷积神经网络组成的复合深度学习系统,它将基因组视为由具有特定出现偏好和上下文的单词组成的句子库,以准确识别植物基因组中的前体微小RNA区域。我们进行了全面的基准测试,涉及代表不同类型的10多种软件和许多经过实验验证的数据集。miWords脱颖而出,准确率突破98%,性能领先约10%。我们还在拟南芥基因组上对miWords进行了评估,它在该基因组上也优于其他被比较的工具。作为演示,我们在茶树基因组上运行了miWords,报告了803个前体微小RNA区域,所有这些区域均由来自多个样本的小RNA测序读数验证,并且其中大多数在功能上得到了降解组测序数据的支持。miWords可作为独立的源代码在https://scbb.ihbt.res.in/miWords/index.php上免费获取。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验