基于蛋白质序列的有效基因表达预测与优化

Effective Gene Expression Prediction and Optimization from Protein Sequences.

作者信息

Liu Tuoyu, Zhang Yiyang, Li Yanjun, Xu Guoshun, Gao Han, Wang Pengtao, Tu Tao, Luo Huiying, Wu Ningfeng, Yao Bin, Liu Bo, Guan Feifei, Huang Huoqing, Tian Jian

机构信息

State Key Laboratory of Animal Nutrition and Feeding, Institute of Animal Sciences, Chinese Academy of Agricultural Sciences, Beijing, 100193, China.

National Key Laboratory of Agricultural Microbiology, Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing, 100081, China.

出版信息

Adv Sci (Weinh). 2025 Feb;12(8):e2407664. doi: 10.1002/advs.202407664. Epub 2025 Jan 9.

DOI:10.1002/advs.202407664

PMID:39783932

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11848636/

Abstract

High soluble protein expression in heterologous hosts is crucial for various research and applications. Despite considerable research on the impact of codon usage on expression levels, the relationship between protein sequence and expression is often overlooked. In this study, a novel connection between protein expression and sequence is uncovered, leading to the development of SRAB (Strength of Relative Amino Acid Bias) based on AEI (Amino Acid Expression Index). The AEI served as an objective measure of this correlation, with higher AEI values enhancing soluble expression. Subsequently, the pre-trained protein model MP-TRANS (MindSpore Protein Transformer) is developed and fine-tuned using transfer learning techniques to create 88 prediction models (MPB-EXP) for predicting heterologous expression levels across 88 species. This approach achieved an average accuracy of 0.78, surpassing conventional machine learning methods. Additionally, a mutant generation model, MPB-MUT, is devised and utilized to enhance expression levels in specific hosts. Experimental validation demonstrated that the top 3 mutants of xylanase (previously not expressed in Escherichia coli) successfully achieved high-level soluble expression in E. coli. These findings highlight the efficacy of the developed model in predicting and optimizing gene expression based on protein sequences.

摘要

在异源宿主中实现高可溶性蛋白表达对于各种研究和应用至关重要。尽管对密码子使用对表达水平的影响进行了大量研究，但蛋白质序列与表达之间的关系常常被忽视。在本研究中，发现了蛋白质表达与序列之间的一种新联系，基于氨基酸表达指数（AEI）开发了相对氨基酸偏倚强度（SRAB）。AEI作为这种相关性的客观度量，AEI值越高，可溶性表达增强。随后，开发了预训练的蛋白质模型MP-TRANS（MindSpore蛋白质变换器），并使用迁移学习技术进行微调，以创建88个预测模型（MPB-EXP），用于预测88个物种的异源表达水平。这种方法实现了0.78的平均准确率，超过了传统机器学习方法。此外，设计并利用了一个突变体生成模型MPB-MUT来提高特定宿主中的表达水平。实验验证表明，木聚糖酶的前3个突变体（以前在大肠杆菌中不表达）在大肠杆菌中成功实现了高水平的可溶性表达。这些发现突出了所开发模型在基于蛋白质序列预测和优化基因表达方面的有效性。