Suppr超能文献

通过全基因密码子随机化和机器学习揭示翻译效率的决定因素。

Revealing determinants of translation efficiency via whole-gene codon randomization and machine learning.

机构信息

Laboratory of Microbiology, Wageningen University, Wageningen, Stippeneng 4, 6708 WE, The Netherlands.

Bioinformatics Group, Wageningen University, Wageningen, Droevendaalsesteeg 1, 6708 PB, The Netherlands.

出版信息

Nucleic Acids Res. 2023 Mar 21;51(5):2363-2376. doi: 10.1093/nar/gkad035.

Abstract

It has been known for decades that codon usage contributes to translation efficiency and hence to protein production levels. However, its role in protein synthesis is still only partly understood. This lack of understanding hampers the design of synthetic genes for efficient protein production. In this study, we generated a synonymous codon-randomized library of the complete coding sequence of red fluorescent protein. Protein production levels and the full coding sequences were determined for 1459 gene variants in Escherichia coli. Using different machine learning approaches, these data were used to reveal correlations between codon usage and protein production. Interestingly, protein production levels can be relatively accurately predicted (Pearson correlation of 0.762) by a Random Forest model that only relies on the sequence information of the first eight codons. In this region, close to the translation initiation site, mRNA secondary structure rather than Codon Adaptation Index (CAI) is the key determinant of protein production. This study clearly demonstrates the key role of codons at the start of the coding sequence. Furthermore, these results imply that commonly used CAI-based codon optimization of the full coding sequence is not a very effective strategy. One should rather focus on optimizing protein production via reducing mRNA secondary structure formation with the first few codons.

摘要

几十年来,人们已经知道密码子的使用对翻译效率,进而对蛋白质的产量水平有贡献。然而,其在蛋白质合成中的作用仍未被完全理解。这种理解上的不足阻碍了高效蛋白质生产的合成基因的设计。在这项研究中,我们生成了红色荧光蛋白完整编码序列的同义密码子随机化文库。在大肠杆菌中,我们测定了 1459 个基因变体的蛋白质产量水平和完整编码序列。使用不同的机器学习方法,我们利用这些数据揭示了密码子使用与蛋白质产量之间的相关性。有趣的是,仅依赖于前 8 个密码子的序列信息,随机森林模型可以相对准确地预测蛋白质产量(Pearson 相关系数为 0.762)。在这个靠近翻译起始位点的区域,mRNA 二级结构而不是密码子适应指数(CAI)是决定蛋白质产量的关键因素。这项研究清楚地表明了编码序列起始处密码子的关键作用。此外,这些结果表明,常用的基于 CAI 的全编码序列密码子优化并不是一种非常有效的策略。人们应该更关注通过减少前几个密码子的 mRNA 二级结构形成来优化蛋白质的产量。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b8d1/10018363/422ac96872d9/gkad035fig1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验