Wu Pengpeng, Nie Zhenjun, Huang Zhiqiang, Zhang Xiaodan
Anhui Province Key Laboratory of Smart Agricultural Technology and Equipment, Anhui Agricultural University, Hefei 230036, China.
School of Life Science, Anhui Agricultural University, Hefei 230036, China.
Plants (Basel). 2023 Apr 14;12(8):1652. doi: 10.3390/plants12081652.
Circular RNAs (circRNAs), which are produced post-splicing of pre-mRNAs, are strongly linked to the emergence of several tumor types. The initial stage in conducting follow-up studies involves identifying circRNAs. Currently, animals are the primary target of most established circRNA recognition technologies. However, the sequence features of plant circRNAs differ from those of animal circRNAs, making it impossible to detect plant circRNAs. For example, there are non-GT/AG splicing signals at circRNA junction sites and few reverse complementary sequences and repetitive elements in the flanking intron sequences of plant circRNAs. In addition, there have been few studies on circRNAs in plants, and thus it is urgent to create a plant-specific method for identifying circRNAs. In this study, we propose CircPCBL, a deep-learning approach that only uses raw sequences to distinguish between circRNAs found in plants and other lncRNAs. CircPCBL comprises two separate detectors: a CNN-BiGRU detector and a GLT detector. The CNN-BiGRU detector takes in the one-hot encoding of the RNA sequence as the input, while the GLT detector uses k-mer (k = 1 - 4) features. The output matrices of the two submodels are then concatenated and ultimately pass through a fully connected layer to produce the final output. To verify the generalization performance of the model, we evaluated CircPCBL using several datasets, and the results revealed that it had an F1 of 85.40% on the validation dataset composed of six different plants species and 85.88%, 75.87%, and 86.83% on the three cross-species independent test sets composed of , , and , respectively. With an accuracy of 90.9% and 90%, respectively, CircPCBL successfully predicted ten of the eleven circRNAs of experimentally reported and nine of the ten lncRNAs of rice on the real set. CircPCBL could potentially contribute to the identification of circRNAs in plants. In addition, it is remarkable that CircPCBL also achieved an average accuracy of 94.08% on the human datasets, which is also an excellent result, implying its potential application in animal datasets. Ultimately, CircPCBL is available as a web server, from which the data and source code can also be downloaded free of charge.
环状RNA(circRNAs)是在前体mRNA剪接后产生的,与多种肿瘤类型的出现密切相关。开展后续研究的初始阶段涉及鉴定circRNAs。目前,动物是大多数成熟的circRNA识别技术的主要研究对象。然而,植物circRNAs的序列特征与动物circRNAs不同,这使得无法检测植物circRNAs。例如,植物circRNAs的环化连接位点存在非GT/AG剪接信号,且侧翼内含子序列中的反向互补序列和重复元件较少。此外,关于植物circRNAs的研究较少,因此迫切需要创建一种针对植物的circRNAs鉴定方法。在本研究中,我们提出了CircPCBL,这是一种深度学习方法,仅使用原始序列来区分植物中发现的circRNAs和其他长链非编码RNA(lncRNAs)。CircPCBL由两个独立的检测器组成:一个CNN-BiGRU检测器和一个GLT检测器。CNN-BiGRU检测器将RNA序列的独热编码作为输入,而GLT检测器使用k-mer(k = 1 - 4)特征。然后将两个子模型的输出矩阵连接起来,最终通过一个全连接层产生最终输出。为了验证模型的泛化性能,我们使用多个数据集对CircPCBL进行了评估,结果显示,在由六种不同植物物种组成的验证数据集上,它的F1值为85.40%,在分别由 、 和 组成的三个跨物种独立测试集上,F1值分别为85.88%、75.87%和86.83%。在真实数据集上,CircPCBL分别以90.9%和90%的准确率成功预测了实验报道的11个circRNAs中的10个以及水稻的10个lncRNAs中的9个。CircPCBL可能有助于植物中circRNAs的鉴定。此外,值得注意的是,CircPCBL在人类数据集上也达到了94.08%的平均准确率,这也是一个优异的结果,暗示了其在动物数据集上的潜在应用。最终,CircPCBL作为一个网络服务器可用,也可以从该服务器免费下载数据和源代码。