Salekin Sirajul, Mostavi Milad, Chiu Yu-Chiao, Chen Yidong, Zhang Jianqiu Michelle, Huang Yufei
Department of Electrical and Computer Engineering, the University of Texas at San Antonio, San Antonio, TX, 78207, USA.
Greehey Children's Cancer Research Institute, University of Texas Health San Antonio, San Antonio, TX, 78229, USA.
Front Phys. 2020 Jun;8. doi: 10.3389/fphy.2020.00196. Epub 2020 Jun 19.
Epitranscriptome is an exciting area that studies different types of modifications in transcripts and the prediction of such modification sites from the transcript sequence is of significant interest. However, the scarcity of positive sites for most modifications imposes critical challenges for training robust algorithms. To circumvent this problem, we propose MR-GAN, a generative adversarial network (GAN) based model, which is trained in an unsupervised fashion on the entire pre-mRNA sequences to learn a low dimensional embedding of transcriptomic sequences. MR-GAN was then applied to extract embeddings of the sequences in a training dataset we created for eight epitranscriptome modifications, including mA, mA, mG, mG, mC, mU, 2'--Me, Pseudouridine (Ψ) and Dihydrouridine (D), of which the positive samples are very limited. Prediction models were trained based on the embeddings extracted by MR-GAN. We compared the prediction performance with the one-hot encoding of the training sequences and SRAMP, a state-of-the-art mA site prediction algorithm and demonstrated that the learned embeddings outperform one-hot encoding by a significant margin for up to 15% improvement. Using MR-GAN, we also investigated the sequence motifs for each modification type and uncovered known motifs as well as new motifs not possible with sequences directly. The results demonstrated that transcriptome features extracted using unsupervised learning could lead to high precision for predicting multiple types of epitranscriptome modifications, even when the data size is small and extremely imbalanced.
表观转录组是一个令人兴奋的研究领域,它研究转录本中不同类型的修饰,并且从转录本序列预测此类修饰位点具有重大意义。然而,大多数修饰的阳性位点稀缺,这给训练强大的算法带来了严峻挑战。为了规避这个问题,我们提出了MR-GAN,这是一种基于生成对抗网络(GAN)的模型,它以无监督方式在整个前体mRNA序列上进行训练,以学习转录组序列的低维嵌入。然后,MR-GAN被应用于在我们为八种表观转录组修饰创建的训练数据集中提取序列的嵌入,这八种修饰包括mA、mA、mG、mG、mC、mU、2'-O-Me、假尿苷(Ψ)和二氢尿苷(D),其中阳性样本非常有限。基于MR-GAN提取的嵌入训练预测模型。我们将预测性能与训练序列的独热编码以及最先进的mA位点预测算法SRAMP进行了比较,结果表明,学习到的嵌入显著优于独热编码,最多可提高15%。使用MR-GAN,我们还研究了每种修饰类型的序列基序,发现了已知基序以及直接从序列中无法发现的新基序。结果表明,即使在数据量小且极度不平衡的情况下,使用无监督学习提取的转录组特征也能实现对多种表观转录组修饰的高精度预测。