School of Information Science and Technology, Northeast Normal University, Changchun, Jilin, 130117, China.
Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, ON, M5S 3E1, Canada.
Adv Sci (Weinh). 2024 Oct;11(39):e2407013. doi: 10.1002/advs.202407013. Epub 2024 Aug 19.
The 3' untranslated regions (3'UTRs) of messenger RNAs contain many important cis-regulatory elements that are under functional and evolutionary constraints. It is hypothesized that these constraints are similar to grammars and syntaxes in human languages and can be modeled by advanced natural language techniques such as Transformers, which has been very effective in modeling complex protein sequence and structures. Here 3UTRBERT is described, which implements an attention-based language model, i.e., Bidirectional Encoder Representations from Transformers (BERT). 3UTRBERT is pre-trained on aggregated 3'UTR sequences of human mRNAs in a task-agnostic manner; the pre-trained model is then fine-tuned for specific downstream tasks such as identifying RBP binding sites, m6A RNA modification sites, and predicting RNA sub-cellular localizations. Benchmark results show that 3UTRBERT generally outperformed other contemporary methods in each of these tasks. More importantly, the self-attention mechanism within 3UTRBERT allows direct visualization of the semantic relationship between sequence elements and effectively identifies regions with important regulatory potential. It is expected that 3UTRBERT model can serve as the foundational tool to analyze various sequence labeling tasks within the 3'UTR fields, thus enhancing the decipherability of post-transcriptional regulatory mechanisms.
信使 RNA 的 3'非翻译区 (3'UTRs) 包含许多重要的顺式调控元件,这些元件受到功能和进化的约束。据推测,这些约束类似于人类语言中的语法和句法,可以通过先进的自然语言技术(如 Transformer)进行建模,Transformer 在建模复杂的蛋白质序列和结构方面非常有效。这里描述了 3UTRBERT,它实现了基于注意力的语言模型,即来自 Transformer 的双向编码器表示 (BERT)。3UTRBERT 以无任务的方式在人类 mRNA 的聚合 3'UTR 序列上进行预训练;然后,针对特定的下游任务(如识别 RBPs 结合位点、m6A RNA 修饰位点和预测 RNA 亚细胞定位)对预训练模型进行微调。基准测试结果表明,3UTRBERT 在这些任务中的每一个任务中都普遍优于其他当代方法。更重要的是,3UTRBERT 中的自注意力机制允许直接可视化序列元素之间的语义关系,并有效地识别具有重要调控潜力的区域。预计 3UTRBERT 模型可以作为分析 3'UTR 领域内各种序列标记任务的基础工具,从而增强转录后调控机制的可解释性。