Institute of Bioinformatics, University Medicine Greifswald, Walther-Rathenau Str. 48, 17489 Greifswald, Germany.
Department of Biosciences, Molecular Cell Biology of Plants, Goethe University, 60438 Frankfurt am Main, Germany.
Int J Mol Sci. 2023 May 17;24(10):8884. doi: 10.3390/ijms24108884.
Non-coding RNA (ncRNA) classes take over important housekeeping and regulatory functions and are quite heterogeneous in terms of length, sequence conservation and secondary structure. High-throughput sequencing reveals that the expressed novel ncRNAs and their classification are important to understand cell regulation and identify potential diagnostic and therapeutic biomarkers. To improve the classification of ncRNAs, we investigated different approaches of utilizing primary sequences and secondary structures as well as the late integration of both using machine learning models, including different neural network architectures. As input, we used the newest version of RNAcentral, focusing on six ncRNA classes, including lncRNA, rRNA, tRNA, miRNA, snRNA and snoRNA. The late integration of graph-encoded structural features and primary sequences in our classifier achieved an overall accuracy of >97%, which could not be increased by more fine-grained subclassification. In comparison to the actual best-performing tool ncRDense, we had a minimal increase of 0.5% in all four overlapping ncRNA classes on a similar test set of sequences. In summary, is not only more accurate than current ncRNA prediction tools but also allows the prediction of long ncRNA classes (lncRNAs, certain rRNAs) up to 12.000 nts and is trained on a more diverse ncRNA dataset retrieved from RNAcentral.
非编码 RNA(ncRNA)种类承担着重要的管家和调节功能,在长度、序列保守性和二级结构方面具有很大的异质性。高通量测序揭示了表达的新型 ncRNA 及其分类对于理解细胞调节和识别潜在的诊断和治疗生物标志物非常重要。为了提高 ncRNA 的分类,我们研究了利用初级序列和二级结构的不同方法,以及使用机器学习模型(包括不同的神经网络架构)进行后期整合。作为输入,我们使用了最新版本的 RNAcentral,重点关注包括 lncRNA、rRNA、tRNA、miRNA、snRNA 和 snoRNA 在内的六种 ncRNA 种类。我们的分类器中对图形编码结构特征和初级序列的后期整合实现了>97%的整体准确性,而更精细的子分类并不能提高这一准确性。与实际表现最佳的工具 ncRDense 相比,在类似的序列测试集中,我们在四个重叠的 ncRNA 种类中都有最小 0.5%的增长。总的来说,不仅比当前的 ncRNA 预测工具更准确,而且还可以预测长达 12000 个核苷酸的长 ncRNA 种类(lncRNA、某些 rRNA),并且是在从 RNAcentral 检索到的更具多样性的 ncRNA 数据集上进行训练的。