Department of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hong Kong SAR, China.
Department of Electrical Engineering, Princeton University, Princeton, New Jersey, United States of America.
PLoS One. 2014 Mar 19;9(3):e89545. doi: 10.1371/journal.pone.0089545. eCollection 2014.
Protein subcellular localization prediction, as an essential step to elucidate the functions in vivo of proteins and identify drugs targets, has been extensively studied in previous decades. Instead of only determining subcellular localization of single-label proteins, recent studies have focused on predicting both single- and multi-location proteins. Computational methods based on Gene Ontology (GO) have been demonstrated to be superior to methods based on other features. However, existing GO-based methods focus on the occurrences of GO terms and disregard their relationships. This paper proposes a multi-label subcellular-localization predictor, namely HybridGO-Loc, that leverages not only the GO term occurrences but also the inter-term relationships. This is achieved by hybridizing the GO frequencies of occurrences and the semantic similarity between GO terms. Given a protein, a set of GO terms are retrieved by searching against the gene ontology database, using the accession numbers of homologous proteins obtained via BLAST search as the keys. The frequency of GO occurrences and semantic similarity (SS) between GO terms are used to formulate frequency vectors and semantic similarity vectors, respectively, which are subsequently hybridized to construct fusion vectors. An adaptive-decision based multi-label support vector machine (SVM) classifier is proposed to classify the fusion vectors. Experimental results based on recent benchmark datasets and a new dataset containing novel proteins show that the proposed hybrid-feature predictor significantly outperforms predictors based on individual GO features as well as other state-of-the-art predictors. For readers' convenience, the HybridGO-Loc server, which is for predicting virus or plant proteins, is available online at http://bioinfo.eie.polyu.edu.hk/HybridGoServer/.
蛋白质亚细胞定位预测作为阐明蛋白质体内功能和鉴定药物靶点的重要步骤,在过去几十年中得到了广泛的研究。近年来,研究的重点不仅在于预测单一标签蛋白质的亚细胞定位,还在于预测单一和多标签蛋白质的亚细胞定位。基于基因本体论 (GO) 的计算方法已被证明优于基于其他特征的方法。然而,现有的基于 GO 的方法主要关注 GO 术语的出现,而忽略了它们之间的关系。本文提出了一种多标签亚细胞定位预测器,即 HybridGO-Loc,它不仅利用了 GO 术语的出现情况,还利用了它们之间的关系。这是通过混合 GO 出现的频率和 GO 术语之间的语义相似性来实现的。给定一个蛋白质,通过使用 BLAST 搜索获得的同源蛋白质的访问号作为键,在基因本体数据库中搜索来检索一组 GO 术语。GO 出现的频率和 GO 术语之间的语义相似性 (SS) 分别用于构建频率向量和语义相似性向量,然后将它们混合以构建融合向量。提出了一种基于自适应决策的多标签支持向量机 (SVM) 分类器来对融合向量进行分类。基于最近的基准数据集和包含新型蛋白质的新数据集的实验结果表明,所提出的混合特征预测器明显优于基于单个 GO 特征的预测器以及其他最先进的预测器。为了方便读者,用于预测病毒或植物蛋白质的 HybridGO-Loc 服务器可在 http://bioinfo.eie.polyu.edu.hk/HybridGoServer/ 上在线访问。