Central European Institute of Technology, Brno, Czech Republic.
Department of Electrical and Computer Engineering, School of Engineering, University of Thessaly, Volos, Greece.
Sci Rep. 2020 Jun 11;10(1):9486. doi: 10.1038/s41598-020-66454-3.
Genomic regions that encode small RNA genes exhibit characteristic patterns in their sequence, secondary structure, and evolutionary conservation. Convolutional Neural Networks are a family of algorithms that can classify data based on learned patterns. Here we present MuStARD an application of Convolutional Neural Networks that can learn patterns associated with user-defined sets of genomic regions, and scan large genomic areas for novel regions exhibiting similar characteristics. We demonstrate that MuStARD is a generic method that can be trained on different classes of human small RNA genomic loci, without need for domain specific knowledge, due to the automated feature and background selection processes built into the model. We also demonstrate the ability of MuStARD for inter-species identification of functional elements by predicting mouse small RNAs (pre-miRNAs and snoRNAs) using models trained on the human genome. MuStARD can be used to filter small RNA-Seq datasets for identification of novel small RNA loci, intra- and inter- species, as demonstrated in three use cases of human, mouse, and fly pre-miRNA prediction. MuStARD is easy to deploy and extend to a variety of genomic classification questions. Code and trained models are freely available at gitlab.com/RBP_Bioinformatics/mustard.
编码小 RNA 基因的基因组区域在其序列、二级结构和进化保守性方面表现出特征模式。卷积神经网络是一类可以根据学习到的模式对数据进行分类的算法。在这里,我们提出了 MuStARD,这是一种卷积神经网络的应用,它可以学习与用户定义的基因组区域集相关的模式,并扫描大片基因组区域以寻找具有类似特征的新区域。我们证明 MuStARD 是一种通用方法,可以在不同类别的人类小 RNA 基因组基因座上进行训练,而无需特定于领域的知识,这要归功于模型中内置的自动化特征和背景选择过程。我们还通过使用在人类基因组上训练的模型来预测小鼠小 RNA(pre-miRNA 和 snoRNA),证明了 MuStARD 用于功能元件的种间识别的能力。MuStARD 可用于过滤小 RNA-Seq 数据集,以识别新型小 RNA 基因座,包括种内和种间,如在人类、小鼠和果蝇 pre-miRNA 预测的三个用例中所示。MuStARD 易于部署和扩展到各种基因组分类问题。代码和训练模型可在 gitlab.com/RBP_Bioinformatics/mustard 上免费获得。