Suppr超能文献

用于预测和解释多标签蛋白质亚细胞定位的稀疏回归

Sparse regressions for predicting and interpreting subcellular localization of multi-label proteins.

作者信息

Wan Shibiao, Mak Man-Wai, Kung Sun-Yuan

机构信息

Department of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hong Kong, SAR, China.

Department of Electrical Engineering, Princeton University, New Jersey, USA.

出版信息

BMC Bioinformatics. 2016 Feb 24;17:97. doi: 10.1186/s12859-016-0940-x.

Abstract

BACKGROUND

Predicting protein subcellular localization is indispensable for inferring protein functions. Recent studies have been focusing on predicting not only single-location proteins, but also multi-location proteins. Almost all of the high performing predictors proposed recently use gene ontology (GO) terms to construct feature vectors for classification. Despite their high performance, their prediction decisions are difficult to interpret because of the large number of GO terms involved.

RESULTS

This paper proposes using sparse regressions to exploit GO information for both predicting and interpreting subcellular localization of single- and multi-location proteins. Specifically, we compared two multi-label sparse regression algorithms, namely multi-label LASSO (mLASSO) and multi-label elastic net (mEN), for large-scale predictions of protein subcellular localization. Both algorithms can yield sparse and interpretable solutions. By using the one-vs-rest strategy, mLASSO and mEN identified 87 and 429 out of more than 8,000 GO terms, respectively, which play essential roles in determining subcellular localization. More interestingly, many of the GO terms selected by mEN are from the biological process and molecular function categories, suggesting that the GO terms of these categories also play vital roles in the prediction. With these essential GO terms, not only where a protein locates can be decided, but also why it resides there can be revealed.

CONCLUSIONS

Experimental results show that the output of both mEN and mLASSO are interpretable and they perform significantly better than existing state-of-the-art predictors. Moreover, mEN selects more features and performs better than mLASSO on a stringent human benchmark dataset. For readers' convenience, an online server called SpaPredictor for both mLASSO and mEN is available at http://bioinfo.eie.polyu.edu.hk/SpaPredictorServer/.

摘要

背景

预测蛋白质亚细胞定位对于推断蛋白质功能至关重要。最近的研究不仅专注于预测单定位蛋白质,还包括多定位蛋白质。几乎所有最近提出的高性能预测器都使用基因本体(GO)术语来构建用于分类的特征向量。尽管它们性能很高,但由于涉及大量的GO术语,其预测决策难以解释。

结果

本文提出使用稀疏回归来利用GO信息预测和解释单定位和多定位蛋白质的亚细胞定位。具体而言,我们比较了两种多标签稀疏回归算法,即多标签套索(mLASSO)和多标签弹性网(mEN),用于蛋白质亚细胞定位的大规模预测。这两种算法都能产生稀疏且可解释的解决方案。通过使用一对其余策略,mLASSO和mEN分别从8000多个GO术语中识别出87个和429个,这些术语在确定亚细胞定位中起着至关重要的作用。更有趣的是,mEN选择的许多GO术语来自生物过程和分子功能类别,这表明这些类别的GO术语在预测中也起着至关重要的作用。有了这些重要的GO术语,不仅可以确定蛋白质的定位,还可以揭示其定位的原因。

结论

实验结果表明,mEN和mLASSO的输出都是可解释的,并且它们的性能明显优于现有的最先进预测器。此外,在严格的人类基准数据集上,mEN选择了更多特征并且比mLASSO表现更好。为方便读者,可通过http://bioinfo.eie.polyu.edu.hk/SpaPredictorServer/获得一个名为SpaPredictor的在线服务器,用于mLASSO和mEN。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/339f/4765148/4e8aebacdf8e/12859_2016_940_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验