Suppr超能文献

cdsBERT - 借助密码子感知扩展蛋白质语言模型。

cdsBERT - Extending Protein Language Models with Codon Awareness.

作者信息

Hallee Logan, Rafailidis Nikolaos, Gleghorn Jason P

机构信息

Center for Bioinformatics and Computational Biology, University of Delaware.

Department of Biological Sciences, University of Delaware.

出版信息

bioRxiv. 2023 Sep 17:2023.09.15.558027. doi: 10.1101/2023.09.15.558027.

Abstract

Recent advancements in Protein Language Models (pLMs) have enabled high-throughput analysis of proteins through primary sequence alone. At the same time, newfound evidence illustrates that codon usage bias is remarkably predictive and can even change the final structure of a protein. Here, we explore these findings by extending the traditional vocabulary of pLMs from amino acids to codons to encapsulate more information inside CoDing Sequences (CDS). We build upon traditional transfer learning techniques with a novel pipeline of token embedding matrix seeding, masked language modeling, and student-teacher knowledge distillation, called MELD. This transformed the pretrained ProtBERT into cdsBERT; a pLM with a codon vocabulary trained on a massive corpus of CDS. Interestingly, cdsBERT variants produced a highly biochemically relevant latent space, outperforming their amino acid-based counterparts on enzyme commission number prediction. Further analysis revealed that synonymous codon token embeddings moved distinctly in the embedding space, showcasing unique additions of information across broad phylogeny inside these traditionally "silent" mutations. This embedding movement correlated significantly with average usage bias across phylogeny. Future fine-tuned organism-specific codon pLMs may potentially have a more significant increase in codon usage fidelity. This work enables an exciting potential in using the codon vocabulary to improve current state-of-the-art structure and function prediction that necessitates the creation of a codon pLM foundation model alongside the addition of high-quality CDS to large-scale protein databases.

摘要

蛋白质语言模型(pLMs)的最新进展使得仅通过蛋白质一级序列就能进行高通量分析。与此同时,新发现的证据表明密码子使用偏好具有显著的预测性,甚至可以改变蛋白质的最终结构。在这里,我们通过将pLMs的传统词汇从氨基酸扩展到密码子,以在编码序列(CDS)中封装更多信息,从而探索这些发现。我们基于传统的迁移学习技术,构建了一个名为MELD的新颖管道,包括令牌嵌入矩阵播种、掩码语言建模和师生知识蒸馏。这将预训练的ProtBERT转变为cdsBERT;这是一个在大量CDS语料库上训练的具有密码子词汇的pLM。有趣的是,cdsBERT变体产生了一个高度与生物化学相关的潜在空间,在酶委员会编号预测方面优于基于氨基酸的对应模型。进一步分析表明,同义密码子令牌嵌入在嵌入空间中明显移动,展示了这些传统上“沉默”突变在广泛系统发育中独特的信息添加。这种嵌入移动与整个系统发育中的平均使用偏好显著相关。未来针对特定生物体的密码子pLM进行微调可能会在密码子使用保真度方面有更显著的提高。这项工作为利用密码子词汇改进当前的结构和功能预测提供了令人兴奋的潜力,这需要创建一个密码子pLM基础模型,并向大规模蛋白质数据库中添加高质量的CDS。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5297/10516008/c5b2976b7c9d/nihpp-2023.09.15.558027v1-f0001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验