蛋白质溶解度：基于序列的预测与实验验证。

Protein solubility: sequence based prediction and experimental verification.

作者信息

Smialowski Pawel, Martin-Galiano Antonio J, Mikolajka Aleksandra, Girschick Tobias, Holak Tad A, Frishman Dmitrij

机构信息

Department of Genome Oriented Bioinformatics, Technische Universität München, Wissenschaftszentrum Weihenstephan, 85350 Freising, Germany.

出版信息

Bioinformatics. 2007 Oct 1;23(19):2536-42. doi: 10.1093/bioinformatics/btl623. Epub 2006 Dec 6.

DOI:10.1093/bioinformatics/btl623

PMID:17150993

Abstract

MOTIVATION

Obtaining soluble proteins in sufficient concentrations is a recurring limiting factor in various experimental studies. Solubility is an individual trait of proteins which, under a given set of experimental conditions, is determined by their amino acid sequence. Accurate theoretical prediction of solubility from sequence is instrumental for setting priorities on targets in large-scale proteomics projects.

RESULTS

We present a machine-learning approach called PROSO to assess the chance of a protein to be soluble upon heterologous expression in Escherichia coli based on its amino acid composition. The classification algorithm is organized as a two-layered structure in which the output of primary support vector machine (SVM) classifiers serves as input for a secondary Naive Bayes classifier. Experimental progress information from the TargetDB database as well as previously published datasets were used as the source of training data. In comparison with previously published methods our classification algorithm possesses improved discriminatory capacity characterized by the Matthews Correlation Coefficient (MCC) of 0.434 between predicted and known solubility states and the overall prediction accuracy of 72% (75 and 68% for positive and negative class, respectively). We also provide experimental verification of our predictions using solubility measurements for 31 mutational variants of two different proteins.

摘要

动机

在各种实验研究中，获得足够浓度的可溶性蛋白质一直是一个反复出现的限制因素。溶解度是蛋白质的个体特性，在给定的一组实验条件下，由其氨基酸序列决定。从序列准确理论预测溶解度有助于在大规模蛋白质组学项目中确定目标的优先级。

结果

我们提出了一种名为PROSO的机器学习方法，用于根据蛋白质的氨基酸组成评估其在大肠杆菌中异源表达时可溶的可能性。分类算法被组织成两层结构，其中初级支持向量机（SVM）分类器的输出作为二级朴素贝叶斯分类器的输入。来自TargetDB数据库的实验进展信息以及先前发表的数据集被用作训练数据的来源。与先前发表的方法相比，我们的分类算法具有更高的判别能力，预测和已知溶解度状态之间的马修斯相关系数（MCC）为0.434，总体预测准确率为72%（阳性和阴性类别分别为75%和68%）。我们还使用两种不同蛋白质的31个突变变体的溶解度测量对我们的预测进行了实验验证。

相似文献

Protein solubility: sequence based prediction and experimental verification.蛋白质溶解度：基于序列的预测与实验验证。

Bioinformatics. 2007 Oct 1;23(19):2536-42. doi: 10.1093/bioinformatics/btl623. Epub 2006 Dec 6.

Improved method for predicting beta-turn using support vector machine.使用支持向量机预测β-转角的改进方法。

Bioinformatics. 2005 May 15;21(10):2370-4. doi: 10.1093/bioinformatics/bti358. Epub 2005 Mar 29.

PROSO II--a new method for protein solubility prediction.PROSO II--一种新的蛋白质溶解度预测方法。

FEBS J. 2012 Jun;279(12):2192-200. doi: 10.1111/j.1742-4658.2012.08603.x. Epub 2012 May 21.

SOLpro: accurate sequence-based prediction of protein solubility.SOLpro：基于序列的蛋白质溶解度精确预测

Bioinformatics. 2009 Sep 1;25(17):2200-7. doi: 10.1093/bioinformatics/btp386. Epub 2009 Jun 23.

POODLE-L: a two-level SVM prediction system for reliably predicting long disordered regions.POODLE-L：一种用于可靠预测长无序区域的两级支持向量机预测系统。

Bioinformatics. 2007 Aug 15;23(16):2046-53. doi: 10.1093/bioinformatics/btm302. Epub 2007 Jun 1.

Probabilistic multi-class multi-kernel learning: on protein fold recognition and remote homology detection.概率多类多核学习：用于蛋白质折叠识别和远程同源性检测

Bioinformatics. 2008 May 15;24(10):1264-70. doi: 10.1093/bioinformatics/btn112. Epub 2008 Mar 31.

Support vector machines for prediction of dihedral angle regions.用于预测二面角区域的支持向量机

Bioinformatics. 2006 Dec 15;22(24):3009-15. doi: 10.1093/bioinformatics/btl489. Epub 2006 Sep 27.

Protein backbone angle prediction with machine learning approaches.基于机器学习方法的蛋白质主链角度预测

Bioinformatics. 2004 Jul 10;20(10):1612-21. doi: 10.1093/bioinformatics/bth136. Epub 2004 Feb 26.

Prediction of protein structural class for the twilight zone sequences.对处于模糊界限区域的序列进行蛋白质结构类别的预测。

Biochem Biophys Res Commun. 2007 Jun 1;357(2):453-60. doi: 10.1016/j.bbrc.2007.03.164. Epub 2007 Apr 5.

Support Vector Machine-based classification of protein folds using the structural properties of amino acid residues and amino acid residue pairs.基于支持向量机，利用氨基酸残基和氨基酸残基对的结构特性对蛋白质折叠进行分类。

Bioinformatics. 2007 Dec 15;23(24):3320-7. doi: 10.1093/bioinformatics/btm527. Epub 2007 Nov 7.

引用本文的文献

In silico construction of a multi-epitope vaccine (RGME-VAC/ATS-1) against the Rickettsia genus using immunoinformatics.利用免疫信息学对立克次氏体属进行多表位疫苗（RGME-VAC/ATS-1）的计算机模拟构建。

Mem Inst Oswaldo Cruz. 2025 Mar 21;120:e240201. doi: 10.1590/0074-02760240201. eCollection 2025.

Genome-wide identification, characterization and expression analysis of tubulin gene family in Populus deltoides.美洲黑杨微管蛋白基因家族的全基因组鉴定、特征分析及表达分析

BMC Plant Biol. 2025 Feb 20;25(1):234. doi: 10.1186/s12870-025-06228-z.

Enhancing protein aggregation prediction: a unified analysis leveraging graph convolutional networks and active learning.增强蛋白质聚集预测：利用图卷积网络和主动学习的统一分析

RSC Adv. 2024 Oct 3;14(43):31439-31450. doi: 10.1039/d4ra06285j. eCollection 2024 Oct 1.

ProSol-multi: Protein solubility prediction via amino acids multi-level correlation and discriminative distribution.ProSol-multi：基于氨基酸多级相关性和判别性分布的蛋白质溶解度预测

Heliyon. 2024 Aug 22;10(17):e36041. doi: 10.1016/j.heliyon.2024.e36041. eCollection 2024 Sep 15.

PLM_Sol: predicting protein solubility by benchmarking multiple protein language models with the updated Escherichia coli protein solubility dataset.PLM_Sol：通过使用更新的大肠杆菌蛋白质可溶性数据集对多个蛋白质语言模型进行基准测试来预测蛋白质可溶性。

Brief Bioinform. 2024 Jul 25;25(5). doi: 10.1093/bib/bbae404.

Structure-Based De Novo Design for the Discovery of Miniprotein Inhibitors Targeting Oncogenic Mutant BRAF.基于结构的从头设计用于发现针对致癌突变 BRAF 的微蛋白抑制剂。

Int J Mol Sci. 2024 May 19;25(10):5535. doi: 10.3390/ijms25105535.

Prediction of matrilineal specific patatin-like protein governing in-vivo maternal haploid induction in maize using support vector machine and di-peptide composition.利用支持向量机和二肽组成预测玉米体内母性单倍体诱导的母系特异性类脂酶蛋白。

Amino Acids. 2024 Mar 9;56(1):20. doi: 10.1007/s00726-023-03368-0.

HybridGCN for protein solubility prediction with adaptive weighting of multiple features.用于蛋白质溶解度预测的混合图卷积网络，具有多特征自适应加权

J Cheminform. 2023 Dec 8;15(1):118. doi: 10.1186/s13321-023-00788-8.

Computational approaches for molecular characterization and structure-based functional elucidation of a hypothetical protein from Mycobacterium tuberculosis.用于结核分枝杆菌一种假定蛋白质的分子表征及基于结构的功能阐释的计算方法。

Genomics Inform. 2023 Jun;21(2):e25. doi: 10.5808/gi.23001. Epub 2023 Jun 30.

In silico and experimental methods for designing a potent anticancer arazyme-herceptin fusion protein in HER2-positive breast cancer.用于设计HER2阳性乳腺癌中一种强效抗癌抗酶-赫赛汀融合蛋白的计算机模拟和实验方法

J Mol Model. 2023 Apr 27;29(5):160. doi: 10.1007/s00894-023-05562-z.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

蛋白质溶解度：基于序列的预测与实验验证。

Protein solubility: sequence based prediction and experimental verification.

作者信息

机构信息

出版信息

MOTIVATION

RESULTS

动机

结果

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献