Chu Yanyi, Yu Dan, Li Yupeng, Huang Kaixuan, Shen Yue, Cong Le, Zhang Jason, Wang Mengdi
Center for Statistics and Machine Learning and Department of Electrical and Computer Engineering, Princeton University, Princeton, NJ 08544, USA.
Department of Pathology, Stanford University School of Medicine, Stanford, CA 94305, USA.
Nat Mach Intell. 2024 Apr;6(4):449-460. doi: 10.1038/s42256-024-00823-9. Epub 2024 Apr 5.
The 5' UTR, a regulatory region at the beginning of an mRNA molecule, plays a crucial role in regulating the translation process and impacts the protein expression level. Language models have showcased their effectiveness in decoding the functions of protein and genome sequences. Here, we introduced a language model for 5' UTR, which we refer to as the UTR-LM. The UTR-LM is pre-trained on endogenous 5' UTRs from multiple species and is further augmented with supervised information including secondary structure and minimum free energy. We fine-tuned the UTR-LM in a variety of downstream tasks. The model outperformed the best known benchmark by up to 5% for predicting the Mean Ribosome Loading, and by up to 8% for predicting the Translation Efficiency and the mRNA Expression Level. The model also applies to identifying unannotated Internal Ribosome Entry Sites within the untranslated region and improves the AUPR from 0.37 to 0.52 compared to the best baseline. Further, we designed a library of 211 novel 5' UTRs with high predicted values of translation efficiency and evaluated them via a wet-lab assay. Experiment results confirmed that our top designs achieved a 32.5% increase in protein production level relative to well-established 5' UTR optimized for therapeutics.
5'非翻译区(5' UTR)是mRNA分子起始处的一个调控区域,在调节翻译过程中起着关键作用,并影响蛋白质表达水平。语言模型已展示出其在解码蛋白质和基因组序列功能方面的有效性。在此,我们引入了一种用于5' UTR的语言模型,我们将其称为UTR-LM。UTR-LM在来自多个物种的内源性5' UTR上进行预训练,并通过包括二级结构和最小自由能在内的监督信息进一步增强。我们在各种下游任务中对UTR-LM进行了微调。该模型在预测平均核糖体负载方面比最知名的基准性能高出5%,在预测翻译效率和mRNA表达水平方面高出8%。该模型还适用于识别非翻译区内未注释的内部核糖体进入位点,与最佳基线相比,将AUPR从0.37提高到了0.52。此外,我们设计了一个包含211个具有高预测翻译效率值的新型5' UTR的文库,并通过湿实验室实验对它们进行了评估。实验结果证实,我们的顶级设计相对于为治疗优化的成熟5' UTR,蛋白质生产水平提高了32.5%。