Wan Shibiao, Mak Man-Wai, Kung Sun-Yuan
Department of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hong Kong SAR, China.
Department of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hong Kong SAR, China.
Anal Biochem. 2015 Mar 15;473:14-27. doi: 10.1016/j.ab.2014.10.014. Epub 2014 Oct 31.
Proteins located in appropriate cellular compartments are of paramount importance to exert their biological functions. Prediction of protein subcellular localization by computational methods is required in the post-genomic era. Recent studies have been focusing on predicting not only single-location proteins but also multi-location proteins. However, most of the existing predictors are far from effective for tackling the challenges of multi-label proteins. This article proposes an efficient multi-label predictor, namely mPLR-Loc, based on penalized logistic regression and adaptive decisions for predicting both single- and multi-location proteins. Specifically, for each query protein, mPLR-Loc exploits the information from the Gene Ontology (GO) database by using its accession number (AC) or the ACs of its homologs obtained via BLAST. The frequencies of GO occurrences are used to construct feature vectors, which are then classified by an adaptive decision-based multi-label penalized logistic regression classifier. Experimental results based on two recent stringent benchmark datasets (virus and plant) show that mPLR-Loc remarkably outperforms existing state-of-the-art multi-label predictors. In addition to being able to rapidly and accurately predict subcellular localization of single- and multi-label proteins, mPLR-Loc can also provide probabilistic confidence scores for the prediction decisions. For readers' convenience, the mPLR-Loc server is available online (http://bioinfo.eie.polyu.edu.hk/mPLRLocServer).
位于适当细胞区室的蛋白质对于发挥其生物学功能至关重要。在后基因组时代,需要通过计算方法预测蛋白质的亚细胞定位。最近的研究不仅集中于预测单定位蛋白质,还包括多定位蛋白质。然而,大多数现有的预测器在应对多标签蛋白质的挑战方面远非有效。本文提出了一种基于惩罚逻辑回归和自适应决策的高效多标签预测器,即mPLR-Loc,用于预测单定位和多定位蛋白质。具体而言,对于每个查询蛋白质,mPLR-Loc通过使用其登录号(AC)或通过BLAST获得的其同源物的AC来利用基因本体(GO)数据库中的信息。GO出现的频率用于构建特征向量,然后由基于自适应决策的多标签惩罚逻辑回归分类器进行分类。基于两个最新的严格基准数据集(病毒和植物)的实验结果表明,mPLR-Loc显著优于现有的最先进的多标签预测器。除了能够快速准确地预测单标签和多标签蛋白质的亚细胞定位外,mPLR-Loc还可以为预测决策提供概率置信度得分。为方便读者,mPLR-Loc服务器可在线获取(http://bioinfo.eie.polyu.edu.hk/mPLRLocServer)。