School of Electronic and Computer Engineering, Peking University, Shenzhen 518055, China.
AI for Science (AI4S)-Preferred Program, Peking University Shenzhen Graduate School, Shenzen 518055, China.
Nucleic Acids Res. 2024 Jan 11;52(1):e3. doi: 10.1093/nar/gkad1031.
Compared with proteins, DNA and RNA are more difficult languages to interpret because four-letter coded DNA/RNA sequences have less information content than 20-letter coded protein sequences. While BERT (Bidirectional Encoder Representations from Transformers)-like language models have been developed for RNA, they are ineffective at capturing the evolutionary information from homologous sequences because unlike proteins, RNA sequences are less conserved. Here, we have developed an unsupervised multiple sequence alignment-based RNA language model (RNA-MSM) by utilizing homologous sequences from an automatic pipeline, RNAcmap, as it can provide significantly more homologous sequences than manually annotated Rfam. We demonstrate that the resulting unsupervised, two-dimensional attention maps and one-dimensional embeddings from RNA-MSM contain structural information. In fact, they can be directly mapped with high accuracy to 2D base pairing probabilities and 1D solvent accessibilities, respectively. Further fine-tuning led to significantly improved performance on these two downstream tasks compared with existing state-of-the-art techniques including SPOT-RNA2 and RNAsnap2. By comparison, RNA-FM, a BERT-based RNA language model, performs worse than one-hot encoding with its embedding in base pair and solvent-accessible surface area prediction. We anticipate that the pre-trained RNA-MSM model can be fine-tuned on many other tasks related to RNA structure and function.
与蛋白质相比,DNA 和 RNA 是更难解读的语言,因为四字母编码的 DNA/RNA 序列比 20 字母编码的蛋白质序列信息量更少。虽然已经开发出了用于 RNA 的类似于 BERT(来自 Transformer 的双向编码器表示)的语言模型,但由于与蛋白质不同,RNA 序列的保守性较低,因此它们在捕获同源序列的进化信息方面效果不佳。在这里,我们通过利用自动流水线 RNAcmap 中的同源序列,开发了一种基于无监督多序列比对的 RNA 语言模型 (RNA-MSM),因为它可以提供比手动注释的 Rfam 多得多的同源序列。我们证明,RNA-MSM 产生的无监督二维注意力图和一维嵌入包含结构信息。实际上,它们可以分别直接以高精度映射到 2D 碱基对概率和 1D 溶剂可及性。与包括 SPOT-RNA2 和 RNAsnap2 在内的现有最先进技术相比,进一步的微调导致这两个下游任务的性能显著提高。相比之下,基于 BERT 的 RNA 语言模型 RNA-FM 在碱基对和溶剂可及表面积预测方面的嵌入表现不如独热编码。我们预计,预训练的 RNA-MSM 模型可以在许多与 RNA 结构和功能相关的其他任务上进行微调。