Shanghai Key Laboratory of Intelligent Information Processing, School of Computer Science, Fudan University, Shanghai, PR China.
BMC Bioinformatics. 2010 Jan 18;11 Suppl 1(Suppl 1):S17. doi: 10.1186/1471-2105-11-S1-S17.
Prediction of protein localization in subnuclear organelles is more challenging than general protein subcelluar localization. There are only three computational models for protein subnuclear localization thus far, to the best of our knowledge. Two models were based on protein primary sequence only. The first model assumed homogeneous amino acid substitution pattern across all protein sequence residue sites and used BLOSUM62 to encode k-mer of protein sequence. Ensemble of SVM based on different k-mers drew the final conclusion, achieving 50% overall accuracy. The simplified assumption did not exploit protein sequence profile and ignored the fact of heterogeneous amino acid substitution patterns across sites. The second model derived the PsePSSM feature representation from protein sequence by simply averaging the profile PSSM and combined the PseAA feature representation to construct a kNN ensemble classifier Nuc-PLoc, achieving 67.4% overall accuracy. The two models based on protein primary sequence only both achieved relatively poor predictive performance. The third model required that GO annotations be available, thus restricting the model's applicability.
In this paper, we only use the amino acid information of protein sequence without any other information to design a widely-applicable model for protein subnuclear localization. We use K-spectrum kernel to exploit the contextual information around an amino acid and the conserved motif information. Besides expanding window size, we adopt various amino acid classification approaches to capture diverse aspects of amino acid physiochemical properties. Each amino acid classification generates a series of spectrum kernels based on different window size. Thus, (I) window expansion can capture more contextual information and cover size-varying motifs; (II) various amino acid classifications can exploit multi-aspect biological information from the protein sequence. Finally, we combine all the spectrum kernels by simple addition into one single kernel called SpectrumKernel+ for protein subnuclear localization.
We conduct the performance evaluation experiments on two benchmark datasets: Lei and Nuc-PLoc. Experimental results show that SpectrumKernel+ achieves substantial performance improvement against the previous model Nuc-PLoc, with overall accuracy 83.47% against 67.4%; and 71.23% against 50% of Lei SVM Ensemble, against 66.50% of Lei GO SVM Ensemble.
The method SpectrumKernel+ can exploit rich amino acid information of protein sequence by embedding into implicit size-varying motifs the multi-aspect amino acid physiochemical properties captured by amino acid classification approaches. The kernels derived from diverse amino acid classification approaches and different sizes of k-mer are summed together for data integration. Experiments show that the method SpectrumKernel+ significantly outperforms the existing models for protein subnuclear localization.
亚核细胞器的蛋白质定位预测比一般的蛋白质亚细胞定位更具挑战性。据我们所知,目前只有三种用于蛋白质亚核定位的计算模型。前两个模型仅基于蛋白质的一级序列。第一个模型假设所有蛋白质序列残基位点的氨基酸替代模式都是均匀的,并使用 BLOSUM62 对蛋白质序列的 k-mer 进行编码。基于不同 k-mer 的 SVM 集成得出最终结论,总体准确率为 50%。这种简化的假设没有利用蛋白质序列的轮廓,也忽略了不同位点氨基酸替代模式不均匀的事实。第二个模型通过简单地对蛋白质序列的轮廓 PSSM 求平均值来从蛋白质序列中导出 PsePSSM 特征表示,并结合 PseAA 特征表示来构建 kNN 集成分类器 Nuc-PLoc,总体准确率为 67.4%。这两个仅基于蛋白质一级序列的模型都取得了相对较差的预测性能。第三个模型需要有 GO 注释,因此限制了模型的适用性。
在本文中,我们仅使用蛋白质序列的氨基酸信息,而不使用任何其他信息,为蛋白质亚核定位设计了一种广泛适用的模型。我们使用 K-光谱核来利用氨基酸周围的上下文信息和保守基序信息。除了扩展窗口大小外,我们还采用了各种氨基酸分类方法来捕捉氨基酸理化性质的不同方面。每种氨基酸分类方法都会根据不同的窗口大小生成一系列光谱核。因此,(I)窗口扩展可以捕获更多的上下文信息,并涵盖大小变化的基序;(II)各种氨基酸分类方法可以从蛋白质序列中利用多方面的生物信息。最后,我们将所有光谱核通过简单的加和组合成一个单一的光谱核 SpectrumKernel+,用于蛋白质亚核定位。
我们在两个基准数据集 Lei 和 Nuc-PLoc 上进行了性能评估实验。实验结果表明,SpectrumKernel+相对于之前的模型 Nuc-PLoc 取得了实质性的性能提升,总体准确率为 83.47%,比 67.4%提高了 16.07%;与 Lei SVM Ensemble 的 71.23%相比,比 66.50%提高了 4.73%。
方法 SpectrumKernel+可以通过将多方面的氨基酸理化性质嵌入到氨基酸分类方法所捕获的隐含大小变化的基序中,来利用蛋白质序列丰富的氨基酸信息。从不同的氨基酸分类方法和不同大小的 k-mer 中派生的核函数被加和在一起进行数据集成。实验表明,该方法 SpectrumKernel+显著优于现有的蛋白质亚核定位模型。