Suppr超能文献

利用新型评分卡方法和二肽组成预测和分析蛋白质溶解度。

Prediction and analysis of protein solubility using a novel scoring card method with dipeptide composition.

机构信息

Department of Biological Science and Technology, National Chiao Tung University, Hsinchu, Taiwan.

出版信息

BMC Bioinformatics. 2012;13 Suppl 17(Suppl 17):S3. doi: 10.1186/1471-2105-13-S17-S3. Epub 2012 Dec 13.

Abstract

BACKGROUND

Existing methods for predicting protein solubility on overexpression in Escherichia coli advance performance by using ensemble classifiers such as two-stage support vector machine (SVM) based classifiers and a number of feature types such as physicochemical properties, amino acid and dipeptide composition, accompanied with feature selection. It is desirable to develop a simple and easily interpretable method for predicting protein solubility, compared to existing complex SVM-based methods.

RESULTS

This study proposes a novel scoring card method (SCM) by using dipeptide composition only to estimate solubility scores of sequences for predicting protein solubility. SCM calculates the propensities of 400 individual dipeptides to be soluble using statistic discrimination between soluble and insoluble proteins of a training data set. Consequently, the propensity scores of all dipeptides are further optimized using an intelligent genetic algorithm. The solubility score of a sequence is determined by the weighted sum of all propensity scores and dipeptide composition. To evaluate SCM by performance comparisons, four data sets with different sizes and variation degrees of experimental conditions were used. The results show that the simple method SCM with interpretable propensities of dipeptides has promising performance, compared with existing SVM-based ensemble methods with a number of feature types. Furthermore, the propensities of dipeptides and solubility scores of sequences can provide insights to protein solubility. For example, the analysis of dipeptide scores shows high propensity of α-helix structure and thermophilic proteins to be soluble.

CONCLUSIONS

The propensities of individual dipeptides to be soluble are varied for proteins under altered experimental conditions. For accurately predicting protein solubility using SCM, it is better to customize the score card of dipeptide propensities by using a training data set under the same specified experimental conditions. The proposed method SCM with solubility scores and dipeptide propensities can be easily applied to the protein function prediction problems that dipeptide composition features play an important role.

AVAILABILITY

The used datasets, source codes of SCM, and supplementary files are available at http://iclab.life.nctu.edu.tw/SCM/.

摘要

背景

现有的预测蛋白质在大肠杆菌中过表达时溶解度的方法通过使用集成分类器(如基于两阶段支持向量机 (SVM) 的分类器)和许多特征类型(如理化性质、氨基酸和二肽组成)来提高性能,并结合特征选择。与现有的基于 SVM 的复杂方法相比,开发一种简单且易于解释的预测蛋白质溶解度的方法是很有必要的。

结果

本研究提出了一种新的评分卡方法(SCM),仅使用二肽组成来估计序列的可溶性评分,以预测蛋白质的溶解度。SCM 通过在训练数据集可溶性和不溶性蛋白质之间进行统计判别来计算 400 个单个二肽的可溶性倾向。然后,使用智能遗传算法进一步优化所有二肽的倾向得分。序列的溶解度得分由所有倾向得分和二肽组成的加权和确定。为了通过性能比较来评估 SCM,使用了四个具有不同大小和实验条件变化程度的数据集。结果表明,与具有多种特征类型的现有的基于 SVM 的集成方法相比,具有可解释二肽倾向的简单方法 SCM 具有良好的性能。此外,二肽的倾向和序列的溶解度得分可以提供对蛋白质溶解度的深入了解。例如,二肽得分的分析表明,α-螺旋结构和嗜热蛋白具有较高的可溶性倾向。

结论

在改变的实验条件下,蛋白质中二肽的可溶性倾向是不同的。为了使用 SCM 准确预测蛋白质的溶解度,最好使用相同指定实验条件下的训练数据集来定制二肽倾向的评分卡。提出的具有溶解度得分和二肽倾向的方法 SCM 可以很容易地应用于二肽组成特征起重要作用的蛋白质功能预测问题。

可用性

使用的数据集、SCM 的源代码和补充文件可在 http://iclab.life.nctu.edu.tw/SCM/ 上获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f2bd/3521471/a1bfa218883b/1471-2105-13-S17-S3-1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验