School of Software Engineering, Chengdu University of Information Technology, Chengdu, China.
Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China.
BMC Biol. 2023 Jan 24;21(1):12. doi: 10.1186/s12915-023-01510-8.
Protein solubility is a precondition for efficient heterologous protein expression at the basis of most industrial applications and for functional interpretation in basic research. However, recurrent formation of inclusion bodies is still an inevitable roadblock in protein science and industry, where only nearly a quarter of proteins can be successfully expressed in soluble form. Despite numerous solubility prediction models having been developed over time, their performance remains unsatisfactory in the context of the current strong increase in available protein sequences. Hence, it is imperative to develop novel and highly accurate predictors that enable the prioritization of highly soluble proteins to reduce the cost of actual experimental work.
In this study, we developed a novel tool, DeepSoluE, which predicts protein solubility using a long-short-term memory (LSTM) network with hybrid features composed of physicochemical patterns and distributed representation of amino acids. Comparison results showed that the proposed model achieved more accurate and balanced performance than existing tools. Furthermore, we explored specific features that have a dominant impact on the model performance as well as their interaction effects.
DeepSoluE is suitable for the prediction of protein solubility in E. coli; it serves as a bioinformatics tool for prescreening of potentially soluble targets to reduce the cost of wet-experimental studies. The publicly available webserver is freely accessible at http://lab.malab.cn/~wangchao/softs/DeepSoluE/ .
蛋白质可溶性是大多数工业应用中高效异源蛋白表达的前提,也是基础研究中功能解释的前提。然而,在蛋白质科学和工业中,包涵体的反复形成仍然是一个不可避免的障碍,只有近四分之一的蛋白质能够以可溶性形式成功表达。尽管随着时间的推移已经开发出了许多可溶性预测模型,但在当前可用蛋白质序列大量增加的情况下,它们的性能仍然不尽如人意。因此,开发新的、高度准确的预测器势在必行,这可以优先选择可溶性高的蛋白质,从而降低实际实验工作的成本。
在这项研究中,我们开发了一种新的工具 DeepSoluE,它使用长短期记忆(LSTM)网络和由理化模式和氨基酸分布式表示组成的混合特征来预测蛋白质的可溶性。比较结果表明,所提出的模型比现有工具具有更准确和平衡的性能。此外,我们还探讨了对模型性能有显著影响的特定特征及其相互作用效应。
DeepSoluE 适用于大肠杆菌中蛋白质可溶性的预测;它是一种生物信息学工具,可用于潜在可溶性靶标的预筛选,以降低湿实验研究的成本。该工具的公共可用网络服务器可免费访问,网址为 http://lab.malab.cn/~wangchao/softs/DeepSoluE/ 。