ESPRESSO：一种用于估算蛋白质在蛋白质表达系统中的表达量和可溶性的系统。

ESPRESSO: a system for estimating protein expression and solubility in protein expression systems.

机构信息

Computational Biology Research Center (CBRC), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan.

出版信息

Proteomics. 2013 May;13(9):1444-56. doi: 10.1002/pmic.201200175.

DOI:10.1002/pmic.201200175

PMID:23436767

Abstract

Recombinant protein technology is essential for conducting protein science and using proteins as materials in pharmaceutical or industrial applications. Although obtaining soluble proteins is still a major experimental obstacle, knowledge about protein expression/solubility under standard conditions may increase the efficiency and reduce the cost of proteomics studies. In this study, we present a computational approach to estimate the probability of protein expression and solubility for two different protein expression systems: in vivo Escherichia coli and wheat germ cell-free, from only the sequence information. It implements two kinds of methods: a sequence/predicted structural property-based method that uses both the sequence and predicted structural features, and a sequence pattern-based method that utilizes the occurrence frequencies of sequence patterns. In the benchmark test, the proposed methods obtained F-scores of around 70%, and outperformed publicly available servers. Applying the proposed methods to genomic data revealed that proteins associated with translation or transcription have a strong tendency to be expressed as soluble proteins by the in vivo E. coli expression system. The sequence pattern-based method also has the potential to indicate a candidate region for modification, to increase protein solubility. All methods are available for free at the ESPRESSO server (http://mbs.cbrc.jp/ESPRESSO).

摘要

重组蛋白技术对于进行蛋白质科学研究以及将蛋白质作为药物或工业应用中的材料至关重要。尽管获得可溶性蛋白质仍然是一个主要的实验障碍，但了解标准条件下的蛋白质表达/可溶性，可能会提高蛋白质组学研究的效率并降低成本。在这项研究中，我们提出了一种计算方法，用于仅从序列信息预测两种不同的蛋白质表达系统（体内大肠杆菌和小麦胚无细胞系统）中蛋白质表达和可溶性的概率。它实现了两种方法：一种是基于序列/预测结构特性的方法，同时使用序列和预测结构特征；另一种是基于序列模式的方法，利用序列模式的出现频率。在基准测试中，所提出的方法获得了约 70%的 F 分数，并且优于公开可用的服务器。将所提出的方法应用于基因组数据表明，与翻译或转录相关的蛋白质在体内大肠杆菌表达系统中更倾向于表达为可溶性蛋白质。基于序列模式的方法也有可能指示候选区域进行修饰，以提高蛋白质的可溶性。所有方法都可以在 ESPRESSO 服务器（http://mbs.cbrc.jp/ESPRESSO）上免费获得。