Suppr超能文献

使用氨基酸和氨基酸对组成预测蛋白质亚细胞定位的监督学习方法。

Supervised learning method for the prediction of subcellular localization of proteins using amino acid and amino acid pair composition.

作者信息

Habib Tanwir, Zhang Chaoyang, Yang Jack Y, Yang Mary Qu, Deng Youping

机构信息

Department of Biological Sciences, University of Southern Mississippi, Hattiesburg, MS 39406, USA.

出版信息

BMC Genomics. 2008;9 Suppl 1(Suppl 1):S16. doi: 10.1186/1471-2164-9-S1-S16.

Abstract

BACKGROUND

Occurrence of protein in the cell is an important step in understanding its function. It is highly desirable to predict a protein's subcellular locations automatically from its sequence. Most studied methods for prediction of subcellular localization of proteins are signal peptides, the location by sequence homology, and the correlation between the total amino acid compositions of proteins. Taking amino-acid composition and amino acid pair composition into consideration helps improving the prediction accuracy.

RESULTS

We constructed a dataset of protein sequences from SWISS-PROT database and segmented them into 12 classes based on their subcellular locations. SVM modules were trained to predict the subcellular location based on amino acid composition and amino acid pair composition. Results were calculated after 10-fold cross validation. Radial Basis Function (RBF) outperformed polynomial and linear kernel functions. Total prediction accuracy reached to 71.8% for amino acid composition and 77.0% for amino acid pair composition. In order to observe the impact of number of subcellular locations we constructed two more datasets of nine and five subcellular locations. Total accuracy was further improved to 79.9% and 85.66%.

CONCLUSIONS

A new SVM based approach is presented based on amino acid and amino acid pair composition. Result shows that data simulation and taking more protein features into consideration improves the accuracy to a great extent. It was also noticed that the data set needs to be crafted to take account of the distribution of data in all the classes.

摘要

背景

细胞中蛋白质的出现是理解其功能的重要一步。非常希望能从蛋白质序列自动预测其亚细胞定位。大多数研究的蛋白质亚细胞定位预测方法是信号肽、基于序列同源性的定位以及蛋白质总氨基酸组成之间的相关性。考虑氨基酸组成和氨基酸对组成有助于提高预测准确性。

结果

我们从SWISS-PROT数据库构建了一个蛋白质序列数据集,并根据其亚细胞定位将它们分为12类。训练支持向量机(SVM)模块以基于氨基酸组成和氨基酸对组成预测亚细胞定位。在10折交叉验证后计算结果。径向基函数(RBF)优于多项式和线性核函数。氨基酸组成的总预测准确率达到71.8%,氨基酸对组成的总预测准确率达到77.0%。为了观察亚细胞定位数量的影响,我们又构建了两个分别包含9个和5个亚细胞定位的数据集。总准确率进一步提高到79.9%和85.66%。

结论

提出了一种基于氨基酸和氨基酸对组成的新的支持向量机方法。结果表明,数据模拟和考虑更多蛋白质特征在很大程度上提高了准确率。还注意到需要精心构建数据集以考虑所有类中数据的分布。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验