Suppr超能文献

基于氨基酸分类的光谱核融合的蛋白质亚核定位。

Amino acid classification based spectrum kernel fusion for protein subnuclear localization.

机构信息

Shanghai Key Laboratory of Intelligent Information Processing, School of Computer Science, Fudan University, Shanghai, PR China.

出版信息

BMC Bioinformatics. 2010 Jan 18;11 Suppl 1(Suppl 1):S17. doi: 10.1186/1471-2105-11-S1-S17.

Abstract

BACKGROUND

Prediction of protein localization in subnuclear organelles is more challenging than general protein subcelluar localization. There are only three computational models for protein subnuclear localization thus far, to the best of our knowledge. Two models were based on protein primary sequence only. The first model assumed homogeneous amino acid substitution pattern across all protein sequence residue sites and used BLOSUM62 to encode k-mer of protein sequence. Ensemble of SVM based on different k-mers drew the final conclusion, achieving 50% overall accuracy. The simplified assumption did not exploit protein sequence profile and ignored the fact of heterogeneous amino acid substitution patterns across sites. The second model derived the PsePSSM feature representation from protein sequence by simply averaging the profile PSSM and combined the PseAA feature representation to construct a kNN ensemble classifier Nuc-PLoc, achieving 67.4% overall accuracy. The two models based on protein primary sequence only both achieved relatively poor predictive performance. The third model required that GO annotations be available, thus restricting the model's applicability.

METHODS

In this paper, we only use the amino acid information of protein sequence without any other information to design a widely-applicable model for protein subnuclear localization. We use K-spectrum kernel to exploit the contextual information around an amino acid and the conserved motif information. Besides expanding window size, we adopt various amino acid classification approaches to capture diverse aspects of amino acid physiochemical properties. Each amino acid classification generates a series of spectrum kernels based on different window size. Thus, (I) window expansion can capture more contextual information and cover size-varying motifs; (II) various amino acid classifications can exploit multi-aspect biological information from the protein sequence. Finally, we combine all the spectrum kernels by simple addition into one single kernel called SpectrumKernel+ for protein subnuclear localization.

RESULTS

We conduct the performance evaluation experiments on two benchmark datasets: Lei and Nuc-PLoc. Experimental results show that SpectrumKernel+ achieves substantial performance improvement against the previous model Nuc-PLoc, with overall accuracy 83.47% against 67.4%; and 71.23% against 50% of Lei SVM Ensemble, against 66.50% of Lei GO SVM Ensemble.

CONCLUSION

The method SpectrumKernel+ can exploit rich amino acid information of protein sequence by embedding into implicit size-varying motifs the multi-aspect amino acid physiochemical properties captured by amino acid classification approaches. The kernels derived from diverse amino acid classification approaches and different sizes of k-mer are summed together for data integration. Experiments show that the method SpectrumKernel+ significantly outperforms the existing models for protein subnuclear localization.

摘要

背景

亚核细胞器的蛋白质定位预测比一般的蛋白质亚细胞定位更具挑战性。据我们所知,目前只有三种用于蛋白质亚核定位的计算模型。前两个模型仅基于蛋白质的一级序列。第一个模型假设所有蛋白质序列残基位点的氨基酸替代模式都是均匀的,并使用 BLOSUM62 对蛋白质序列的 k-mer 进行编码。基于不同 k-mer 的 SVM 集成得出最终结论,总体准确率为 50%。这种简化的假设没有利用蛋白质序列的轮廓,也忽略了不同位点氨基酸替代模式不均匀的事实。第二个模型通过简单地对蛋白质序列的轮廓 PSSM 求平均值来从蛋白质序列中导出 PsePSSM 特征表示,并结合 PseAA 特征表示来构建 kNN 集成分类器 Nuc-PLoc,总体准确率为 67.4%。这两个仅基于蛋白质一级序列的模型都取得了相对较差的预测性能。第三个模型需要有 GO 注释,因此限制了模型的适用性。

方法

在本文中,我们仅使用蛋白质序列的氨基酸信息,而不使用任何其他信息,为蛋白质亚核定位设计了一种广泛适用的模型。我们使用 K-光谱核来利用氨基酸周围的上下文信息和保守基序信息。除了扩展窗口大小外,我们还采用了各种氨基酸分类方法来捕捉氨基酸理化性质的不同方面。每种氨基酸分类方法都会根据不同的窗口大小生成一系列光谱核。因此,(I)窗口扩展可以捕获更多的上下文信息,并涵盖大小变化的基序;(II)各种氨基酸分类方法可以从蛋白质序列中利用多方面的生物信息。最后,我们将所有光谱核通过简单的加和组合成一个单一的光谱核 SpectrumKernel+,用于蛋白质亚核定位。

结果

我们在两个基准数据集 Lei 和 Nuc-PLoc 上进行了性能评估实验。实验结果表明,SpectrumKernel+相对于之前的模型 Nuc-PLoc 取得了实质性的性能提升,总体准确率为 83.47%,比 67.4%提高了 16.07%;与 Lei SVM Ensemble 的 71.23%相比,比 66.50%提高了 4.73%。

结论

方法 SpectrumKernel+可以通过将多方面的氨基酸理化性质嵌入到氨基酸分类方法所捕获的隐含大小变化的基序中,来利用蛋白质序列丰富的氨基酸信息。从不同的氨基酸分类方法和不同大小的 k-mer 中派生的核函数被加和在一起进行数据集成。实验表明,该方法 SpectrumKernel+显著优于现有的蛋白质亚核定位模型。

相似文献

1
Amino acid classification based spectrum kernel fusion for protein subnuclear localization.
BMC Bioinformatics. 2010 Jan 18;11 Suppl 1(Suppl 1):S17. doi: 10.1186/1471-2105-11-S1-S17.
2
An ensemble method for predicting subnuclear localizations from primary protein structures.
PLoS One. 2013;8(2):e57225. doi: 10.1371/journal.pone.0057225. Epub 2013 Feb 27.
3
An SVM-based system for predicting protein subnuclear localizations.
BMC Bioinformatics. 2005 Dec 7;6:291. doi: 10.1186/1471-2105-6-291.
4
Nuc-PLoc: a new web-server for predicting protein subnuclear localization by fusing PseAA composition and PsePSSM.
Protein Eng Des Sel. 2007 Nov;20(11):561-7. doi: 10.1093/protein/gzm057. Epub 2007 Nov 10.
5
Gene ontology based transfer learning for protein subcellular localization.
BMC Bioinformatics. 2011 Feb 2;12:44. doi: 10.1186/1471-2105-12-44.
7
SVM-Fold: a tool for discriminative multi-class protein fold and superfamily recognition.
BMC Bioinformatics. 2007 May 22;8 Suppl 4(Suppl 4):S2. doi: 10.1186/1471-2105-8-S4-S2.
8
Predicting protein subnuclear localization using GO-amino-acid composition features.
Biosystems. 2009 Nov;98(2):73-9. doi: 10.1016/j.biosystems.2009.06.007. Epub 2009 Jul 5.
9
Hum-PLoc: a novel ensemble classifier for predicting human protein subcellular localization.
Biochem Biophys Res Commun. 2006 Aug 18;347(1):150-7. doi: 10.1016/j.bbrc.2006.06.059. Epub 2006 Jun 21.
10
Profile-based string kernels for remote homology detection and motif extraction.
J Bioinform Comput Biol. 2005 Jun;3(3):527-50. doi: 10.1142/s021972000500120x.

引用本文的文献

1
Intelligent Protein Design and Molecular Characterization Techniques: A Comprehensive Review.
Molecules. 2023 Nov 30;28(23):7865. doi: 10.3390/molecules28237865.
2
Multi-Omics Data Fusion via a Joint Kernel Learning Model for Cancer Subtype Discovery and Essential Gene Identification.
Front Genet. 2021 Mar 4;12:647141. doi: 10.3389/fgene.2021.647141. eCollection 2021.
3
Prediction of Protein Sub-Mitochondria Locations Using Protein Interaction Networks.
Iran J Biotechnol. 2018 Aug 11;16(3):e1933. doi: 10.15171/ijb.1933. eCollection 2018 Aug.
5
Protein sub-nuclear localization prediction using SVM and Pfam domain information.
PLoS One. 2014 Jun 4;9(6):e98345. doi: 10.1371/journal.pone.0098345. eCollection 2014.
6
Frequency of dipeptides and antidipeptides.
Comput Struct Biotechnol J. 2013 Aug 14;8:e201308001. doi: 10.5936/csbj.201308001. eCollection 2013.
7
Efficient and interpretable prediction of protein functional classes by correspondence analysis and compact set relations.
PLoS One. 2013 Oct 11;8(10):e75542. doi: 10.1371/journal.pone.0075542. eCollection 2013.
8
An ensemble method for predicting subnuclear localizations from primary protein structures.
PLoS One. 2013;8(2):e57225. doi: 10.1371/journal.pone.0057225. Epub 2013 Feb 27.
10
Multi-label multi-kernel transfer learning for human protein subcellular localization.
PLoS One. 2012;7(6):e37716. doi: 10.1371/journal.pone.0037716. Epub 2012 Jun 13.

本文引用的文献

1
A method to improve protein subcellular localization prediction by integrating various biological data sources.
BMC Bioinformatics. 2009 Jan 30;10 Suppl 1(Suppl 1):S43. doi: 10.1186/1471-2105-10-S1-S43.
2
Improved prediction of malaria degradomes by supervised learning with SVM and profile kernel.
Genetica. 2009 May;136(1):189-209. doi: 10.1007/s10709-008-9336-9. Epub 2008 Dec 6.
3
Protein networks markedly improve prediction of subcellular localization in multiple eukaryotic species.
Nucleic Acids Res. 2008 Nov;36(20):e136. doi: 10.1093/nar/gkn619. Epub 2008 Oct 4.
4
PairProSVM: protein subcellular localization based on local pairwise profile alignment and SVM.
IEEE/ACM Trans Comput Biol Bioinform. 2008 Jul-Sep;5(3):416-22. doi: 10.1109/TCBB.2007.70256.
5
Predicting sub-Golgi localization of type II membrane proteins.
Bioinformatics. 2008 Aug 15;24(16):1779-86. doi: 10.1093/bioinformatics/btn309. Epub 2008 Jun 18.
7
An overview of statistical learning theory.
IEEE Trans Neural Netw. 1999;10(5):988-99. doi: 10.1109/72.788640.
8
Nuc-PLoc: a new web-server for predicting protein subnuclear localization by fusing PseAA composition and PsePSSM.
Protein Eng Des Sel. 2007 Nov;20(11):561-7. doi: 10.1093/protein/gzm057. Epub 2007 Nov 10.
9
10
Prediction of subcellular protein localization based on functional domain composition.
Biochem Biophys Res Commun. 2007 Jun 1;357(2):366-70. doi: 10.1016/j.bbrc.2007.03.139. Epub 2007 Apr 2.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验