Research Group for Genomic Epidemiology, National Food Institute, Technical University of Denmark, 2800 Kgs. Lyngby, Denmark.
Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, 2200 Copenhagen, Denmark.
Comput Biol Chem. 2021 Dec;95:107596. doi: 10.1016/j.compbiolchem.2021.107596. Epub 2021 Oct 27.
A crucial process in the production of industrial enzymes is recombinant gene expression, which aims to induce enzyme overexpression of the genes in a host microbe. Current approaches for securing overexpression rely on molecular tools such as adjusting the recombinant expression vector, adjusting cultivation conditions, or performing codon optimizations. However, such strategies are time-consuming, and an alternative strategy would be to select genes for better compatibility with the recombinant host. Several methods for predicting soluble expression are available; however, they are all optimized for the expression host Escherichia coli and do not consider the possibility of an expressed protein not being soluble. We show that these tools are not suited for predicting expression potential in the industrially important host Bacillus subtilis. Instead, we build a B. subtilis-specific machine learning model for expressibility prediction. Given millions of unlabelled proteins and a small labeled dataset, we can successfully train such a predictive model. The unlabeled proteins provide a performance boost relative to using amino acid frequencies of the labeled proteins as input. On average, we obtain a modest performance of 0.64 area-under-the-curve (AUC) and 0.2 Matthews correlation coefficient (MCC). However, we find that this is sufficient for the prioritization of expression candidates for high-throughput studies. Moreover, the predicted class probabilities are correlated with expression levels. A number of features related to protein expression, including base frequencies and solubility, are captured by the model.
在工业酶生产中,一个关键的过程是重组基因表达,旨在诱导宿主微生物中基因的酶过表达。目前,实现过表达的方法依赖于分子工具,如调整重组表达载体、调整培养条件或进行密码子优化。然而,这些策略耗时耗力,另一种策略是选择与重组宿主更好兼容的基因。有几种预测可溶性表达的方法,但它们都是针对表达宿主大肠杆菌进行优化的,并不考虑表达蛋白可能不具有可溶性的情况。我们表明,这些工具不适合预测在工业上重要的宿主枯草芽孢杆菌中的表达潜力。相反,我们构建了一个枯草芽孢杆菌特异性的机器学习表达预测模型。给定数百万个未标记的蛋白质和一个小的标记数据集,我们可以成功地训练这样的预测模型。与使用标记蛋白质的氨基酸频率作为输入相比,未标记的蛋白质提供了性能提升。平均而言,我们获得了 0.64 的曲线下面积 (AUC) 和 0.2 的马修斯相关系数 (MCC) 的中等性能。然而,我们发现这足以对高通量研究的表达候选物进行优先级排序。此外,预测的类别概率与表达水平相关。该模型捕获了与蛋白质表达相关的许多特征,包括碱基频率和溶解度。