Wan Shibiao, Mak Man-Wai, Kung Sun-Yuan
Department of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hong Kong, SAR, China.
Department of Electrical Engineering, Princeton University, New Jersey, USA.
BMC Bioinformatics. 2016 Feb 24;17:97. doi: 10.1186/s12859-016-0940-x.
Predicting protein subcellular localization is indispensable for inferring protein functions. Recent studies have been focusing on predicting not only single-location proteins, but also multi-location proteins. Almost all of the high performing predictors proposed recently use gene ontology (GO) terms to construct feature vectors for classification. Despite their high performance, their prediction decisions are difficult to interpret because of the large number of GO terms involved.
This paper proposes using sparse regressions to exploit GO information for both predicting and interpreting subcellular localization of single- and multi-location proteins. Specifically, we compared two multi-label sparse regression algorithms, namely multi-label LASSO (mLASSO) and multi-label elastic net (mEN), for large-scale predictions of protein subcellular localization. Both algorithms can yield sparse and interpretable solutions. By using the one-vs-rest strategy, mLASSO and mEN identified 87 and 429 out of more than 8,000 GO terms, respectively, which play essential roles in determining subcellular localization. More interestingly, many of the GO terms selected by mEN are from the biological process and molecular function categories, suggesting that the GO terms of these categories also play vital roles in the prediction. With these essential GO terms, not only where a protein locates can be decided, but also why it resides there can be revealed.
Experimental results show that the output of both mEN and mLASSO are interpretable and they perform significantly better than existing state-of-the-art predictors. Moreover, mEN selects more features and performs better than mLASSO on a stringent human benchmark dataset. For readers' convenience, an online server called SpaPredictor for both mLASSO and mEN is available at http://bioinfo.eie.polyu.edu.hk/SpaPredictorServer/.
预测蛋白质亚细胞定位对于推断蛋白质功能至关重要。最近的研究不仅专注于预测单定位蛋白质,还包括多定位蛋白质。几乎所有最近提出的高性能预测器都使用基因本体(GO)术语来构建用于分类的特征向量。尽管它们性能很高,但由于涉及大量的GO术语,其预测决策难以解释。
本文提出使用稀疏回归来利用GO信息预测和解释单定位和多定位蛋白质的亚细胞定位。具体而言,我们比较了两种多标签稀疏回归算法,即多标签套索(mLASSO)和多标签弹性网(mEN),用于蛋白质亚细胞定位的大规模预测。这两种算法都能产生稀疏且可解释的解决方案。通过使用一对其余策略,mLASSO和mEN分别从8000多个GO术语中识别出87个和429个,这些术语在确定亚细胞定位中起着至关重要的作用。更有趣的是,mEN选择的许多GO术语来自生物过程和分子功能类别,这表明这些类别的GO术语在预测中也起着至关重要的作用。有了这些重要的GO术语,不仅可以确定蛋白质的定位,还可以揭示其定位的原因。
实验结果表明,mEN和mLASSO的输出都是可解释的,并且它们的性能明显优于现有的最先进预测器。此外,在严格的人类基准数据集上,mEN选择了更多特征并且比mLASSO表现更好。为方便读者,可通过http://bioinfo.eie.polyu.edu.hk/SpaPredictorServer/获得一个名为SpaPredictor的在线服务器,用于mLASSO和mEN。