Naghipourfar Mohsen, Chen Siyu, Howard Mathew K, Macdonald Christian B, Saberi Ali, Hagen Timo, Mofrad Mohammad R K, Coyote-Maestas Willow, Goodarzi Hani
Molecular Cell Biomechanics Laboratory, Departments of Bioengineering and Mechanical Engineering, University of California, Berkeley, Berkeley, CA, USA.
Arc Institute, Palo Alto, CA, USA.
bioRxiv. 2024 Oct 13:2024.10.10.617568. doi: 10.1101/2024.10.10.617568.
In the canonical genetic code, many amino acids are assigned more than one codon. Work by us and others has shown that the choice of these synonymous codon is not random, and carries regulatory and functional consequences. Existing protein foundation models ignore this context-dependent role of coding sequence in shaping the protein landscape of the cell. To address this gap, we introduce cdsFM, a suite of codon-resolution large language models, including both EnCodon and DeCodon models, with up to 1B parameters. Pre-trained on 60 million protein-coding sequences from more than 5,000 species, our models effectively learn the relationship between codons and amino acids, recapitualing the overall structure of the genetic code. In addition to outperforming state-of-the-art genomic foundation models in a variety of zero-shot and few-shot learning tasks, the larger pre-trained models were superior in predicting the choice of synonymous codons. To systematically assess the impact of synonymous codon choices on protein expression and our models' ability to capture these effects, we generated a large dataset measuring overall and surface expression levels of three proteins as a function of changes in their synonymous codons. We showed that our EnCodon models could be readily fine-tuned to predict the contextual consequences of synonymous codon choices. Armed with this knowledge, we applied EnCodon to existing clinical datasets of synonymous variants, and we identified a large number of synonymous codons that are likely pathogenic, several of which we experimentally confirmed in a cell-based model. Together, our findings establish the cdsFM suite as a powerful tool for decoding the complex functional grammar underlying the choice of synonymous codons.
在标准遗传密码中,许多氨基酸被赋予了不止一个密码子。我们和其他人的研究表明,这些同义密码子的选择并非随机,而是具有调控和功能上的影响。现有的蛋白质基础模型忽略了编码序列在塑造细胞蛋白质格局中这种依赖上下文的作用。为了填补这一空白,我们引入了cdsFM,这是一套密码子分辨率的大语言模型,包括EnCodon和DeCodon模型,参数多达10亿。我们的模型在来自5000多个物种的6000万个蛋白质编码序列上进行预训练,有效地学习了密码子与氨基酸之间的关系,概括了遗传密码的整体结构。除了在各种零样本和少样本学习任务中优于当前最先进的基因组基础模型外,更大的预训练模型在预测同义密码子的选择方面也更胜一筹。为了系统地评估同义密码子选择对蛋白质表达的影响以及我们的模型捕捉这些影响的能力,我们生成了一个大型数据集,测量三种蛋白质的整体和表面表达水平随其同义密码子变化的函数关系。我们表明,我们的EnCodon模型可以很容易地进行微调,以预测同义密码子选择的上下文后果。有了这些知识,我们将EnCodon应用于现有的同义变体临床数据集,并识别出大量可能致病的同义密码子,其中有几个我们在基于细胞的模型中通过实验得到了证实。总之,我们的研究结果确立了cdsFM套件作为一种强大工具,用于解码同义密码子选择背后复杂的功能语法。