Shulgina Yekaterina, Trinidad Marena I, Langeberg Conner J, Nisonoff Hunter, Chithrananda Seyone, Skopintsev Petr, Nissley Amos J, Patel Jaymin, Boger Ron S, Shi Honglue, Yoon Peter H, Doherty Erin E, Pande Tara, Iyer Aditya M, Doudna Jennifer A, Cate Jamie H D
Innovative Genomics Institute, University of California, Berkeley, CA, USA.
Department of Molecular and Cell Biology, University of California, Berkeley, CA, USA.
Nat Commun. 2024 Dec 5;15(1):10627. doi: 10.1038/s41467-024-54812-y.
Structured RNA lies at the heart of many central biological processes, from gene expression to catalysis. RNA structure prediction is not yet possible due to a lack of high-quality reference data associated with organismal phenotypes that could inform RNA function. We present GARNET (Gtdb Acquired RNa with Environmental Temperatures), a new database for RNA structural and functional analysis anchored to the Genome Taxonomy Database (GTDB). GARNET links RNA sequences to experimental and predicted optimal growth temperatures of GTDB reference organisms. Using GARNET, we develop sequence- and structure-aware RNA generative models, with overlapping triplet tokenization providing optimal encoding for a GPT-like model. Leveraging hyperthermophilic RNAs in GARNET and these RNA generative models, we identify mutations in ribosomal RNA that confer increased thermostability to the Escherichia coli ribosome. The GTDB-derived data and deep learning models presented here provide a foundation for understanding the connections between RNA sequence, structure, and function.
结构化RNA处于许多核心生物过程的核心,从基因表达到催化作用。由于缺乏与生物体表型相关的高质量参考数据来阐明RNA功能,目前还无法进行RNA结构预测。我们展示了GARNET(基于环境温度获取的基因组分类数据库RNA),这是一个锚定在基因组分类数据库(GTDB)上的用于RNA结构和功能分析的新数据库。GARNET将RNA序列与GTDB参考生物的实验和预测最佳生长温度联系起来。利用GARNET,我们开发了序列和结构感知的RNA生成模型,重叠三联体分词为类似GPT的模型提供了最佳编码。利用GARNET中的超嗜热RNA和这些RNA生成模型,我们确定了核糖体RNA中的突变,这些突变赋予大肠杆菌核糖体更高的热稳定性。本文介绍的源自GTDB的数据和深度学习模型为理解RNA序列、结构和功能之间的联系奠定了基础。