Institute of Biomedical Informatics, National Yang-Ming University, Taipei, Taiwan.
BMC Bioinformatics. 2010 Jan 18;11 Suppl 1(Suppl 1):S21. doi: 10.1186/1471-2105-11-S1-S21.
Recombinant protein production is a useful biotechnology to produce a large quantity of highly soluble proteins. Currently, the most widely used production system is to fuse a target protein into different vectors in Escherichia coli (E. coli). However, the production efficacy of different vectors varies for different target proteins. Trial-and-error is still the common practice to find out the efficacy of a vector for a given target protein. Previous studies are limited in that they assumed that proteins would be over-expressed and focused only on the solubility of expressed proteins. In fact, many pairings of vectors and proteins result in no expression.
In this study, we applied machine learning to train prediction models to predict whether a pairing of vector-protein will express or not express in E. coli. For expressed cases, the models further predict whether the expressed proteins would be soluble. We collected a set of real cases from the clients of our recombinant protein production core facility, where six different vectors were designed and studied. This set of cases is used in both training and evaluation of our models. We evaluate three different models based on the support vector machines (SVM) and their ensembles. Unlike many previous works, these models consider the sequence of the target protein as well as the sequence of the whole fusion vector as the features. We show that a model that classifies a case into one of the three classes (no expression, inclusion body and soluble) outperforms a model that considers the nested structure of the three classes, while a model that can take advantage of the hierarchical structure of the three classes performs slight worse but comparably to the best model. Meanwhile, compared to previous works, we show that the prediction accuracy of our best method still performs the best. Lastly, we briefly present two methods to use the trained model in the design of the recombinant protein production systems to improve the chance of high soluble protein production.
In this paper, we show that a machine learning approach to the prediction of the efficacy of a vector for a target protein in a recombinant protein production system is promising and may compliment traditional knowledge-driven study of the efficacy. We will release our program to share with other labs in the public domain when this paper is published.
重组蛋白生产是一种生产大量高可溶性蛋白的有用生物技术。目前,最广泛使用的生产系统是将靶蛋白融合到大肠杆菌(E. coli)的不同载体中。然而,不同载体对不同靶蛋白的生产效率不同。寻找给定靶蛋白的载体的效果仍然是一种反复试验的做法。以前的研究仅限于假设蛋白质会过表达,并只关注表达蛋白的溶解度。事实上,许多载体-蛋白的配对结果是无表达。
在这项研究中,我们应用机器学习来训练预测模型,以预测载体-蛋白的配对是否会在大肠杆菌中表达。对于表达的情况,模型进一步预测表达的蛋白是否可溶。我们从我们的重组蛋白生产核心设施的客户那里收集了一组真实案例,其中设计和研究了六种不同的载体。该组案例用于模型的训练和评估。我们基于支持向量机(SVM)及其集成来评估三个不同的模型。与许多以前的工作不同,这些模型将靶蛋白的序列以及整个融合载体的序列作为特征。我们表明,将案例分类为三类之一(无表达、包涵体和可溶)的模型优于考虑三类嵌套结构的模型,而能够利用三类层次结构的模型表现稍差,但与最佳模型相当。同时,与以前的工作相比,我们表明我们的最佳方法的预测准确性仍然表现最好。最后,我们简要介绍了两种在重组蛋白生产系统设计中使用训练模型的方法,以提高生产高可溶性蛋白的机会。
在本文中,我们表明,机器学习方法预测重组蛋白生产系统中靶蛋白的载体效果是有前途的,并且可以补充传统的基于知识的载体效果研究。本文发表后,我们将在公共领域发布我们的程序,与其他实验室共享。