College of Information Science and Engineering, Hunan University, Changsha 410082, China.
School of Mathematics and Statistics, Hainan Normal University, Haikou 570100, China.
Molecules. 2019 Mar 6;24(5):919. doi: 10.3390/molecules24050919.
The prediction of protein subcellular localization is critical for inferring protein functions, gene regulations and protein-protein interactions. With the advances of high-throughput sequencing technologies and proteomic methods, the protein sequences of numerous yeasts have become publicly available, which enables us to computationally predict yeast protein subcellular localization. However, widely-used protein sequence representation techniques, such as amino acid composition and the Chou's pseudo amino acid composition (PseAAC), are difficult in extracting adequate information about the interactions between residues and position distribution of each residue. Therefore, it is still urgent to develop novel sequence representations. In this study, we have presented two novel protein sequence representation techniques including Generalized Chaos Game Representation (GCGR) based on the frequency and distributions of the residues in the protein primary sequence, and novel statistics and information theory (NSI) reflecting local position information of the sequence. In the GCGR + NSI representation, a protein primary sequence is simply represented by a 5-dimensional feature vector, while other popular methods like PseAAC and dipeptide adopt features of more than hundreds of dimensions. In practice, the feature representation is highly efficient in predicting protein subcellular localization. Even without using machine learning-based classifiers, a simple model based on the feature vector can achieve prediction accuracies of 0.8825 and 0.7736 respectively for the CL317 and ZW225 datasets. To further evaluate the effectiveness of the proposed encoding schemes, we introduce a multi-view features-based method to combine the two above-mentioned features with other well-known features including PseAAC and dipeptide composition, and use support vector machine as the classifier to predict protein subcellular localization. This novel model achieves prediction accuracies of 0.927 and 0.871 respectively for the CL317 and ZW225 datasets, better than other existing methods in the jackknife tests. The results suggest that the GCGR and NSI features are useful complements to popular protein sequence representations in predicting yeast protein subcellular localization. Finally, we validate a few newly predicted protein subcellular localizations by evidences from some published articles in authority journals and books.
蛋白质亚细胞定位的预测对于推断蛋白质功能、基因调控和蛋白质-蛋白质相互作用至关重要。随着高通量测序技术和蛋白质组学方法的进步,许多酵母的蛋白质序列已经公开可用,这使得我们能够计算预测酵母蛋白质亚细胞定位。然而,广泛使用的蛋白质序列表示技术,如氨基酸组成和周的伪氨基酸组成(PseAAC),很难提取关于残基之间相互作用和每个残基位置分布的足够信息。因此,开发新的序列表示仍然是当务之急。在这项研究中,我们提出了两种新的蛋白质序列表示技术,包括基于蛋白质一级序列中残基频率和分布的广义混沌游戏表示(GCGR),以及反映序列局部位置信息的新统计和信息理论(NSI)。在 GCGR + NSI 表示中,蛋白质一级序列简单地表示为 5 维特征向量,而其他流行的方法,如 PseAAC 和二肽,采用的特征维度超过数百个。在实践中,特征表示在预测蛋白质亚细胞定位方面非常高效。即使不使用基于机器学习的分类器,仅基于特征向量的简单模型也可以分别为 CL317 和 ZW225 数据集实现 0.8825 和 0.7736 的预测精度。为了进一步评估所提出的编码方案的有效性,我们引入了一种多视图特征方法,将上述两种特征与其他著名特征(包括 PseAAC 和二肽组成)相结合,并使用支持向量机作为分类器来预测蛋白质亚细胞定位。这个新模型分别为 CL317 和 ZW225 数据集实现了 0.927 和 0.871 的预测精度,在交叉验证测试中优于其他现有方法。结果表明,GCGR 和 NSI 特征在预测酵母蛋白质亚细胞定位方面是流行的蛋白质序列表示的有用补充。最后,我们通过一些权威期刊和书籍上发表的文章中的证据验证了一些新预测的蛋白质亚细胞定位。