Department of Computational Biology, Graduate School of Frontier Sciences, University of Tokyo, Kashiwa, Japan.
BMC Genomics. 2014 Jan 20;15:46. doi: 10.1186/1471-2164-15-46.
Protein subcellular localization is a central problem in understanding cell biology and has been the focus of intense research. In order to predict localization from amino acid sequence a myriad of features have been tried: including amino acid composition, sequence similarity, the presence of certain motifs or domains, and many others. Surprisingly, sequence conservation of sorting motifs has not yet been employed, despite its extensive use for tasks such as the prediction of transcription factor binding sites.
Here, we flip the problem around, and present a proof of concept for the idea that the lack of sequence conservation can be a novel feature for localization prediction. We show that for yeast, mammal and plant datasets, evolutionary sequence divergence alone has significant power to identify sequences with N-terminal sorting sequences. Moreover sequence divergence is nearly as effective when computed on automatically defined ortholog sets as on hand curated ones. Unfortunately, sequence divergence did not necessarily increase classification performance when combined with some traditional sequence features such as amino acid composition. However a post-hoc analysis of the proteins in which sequence divergence changes the prediction yielded some proteins with atypical (i.e. not MPP-cleaved) matrix targeting signals as well as a few misannotations.
We report the results of the first quantitative study of the effectiveness of evolutionary sequence divergence as a feature for protein subcellular localization prediction. We show that divergence is indeed useful for prediction, but it is not trivial to improve overall accuracy simply by adding this feature to classical sequence features. Nevertheless we argue that sequence divergence is a promising feature and show anecdotal examples in which it succeeds where other features fail.
蛋白质亚细胞定位是理解细胞生物学的核心问题,也是研究的焦点。为了从氨基酸序列预测定位,人们尝试了无数的特征:包括氨基酸组成、序列相似性、特定模体或结构域的存在等。令人惊讶的是,尽管排序基序的序列保守性已被广泛用于转录因子结合位点预测等任务,但尚未将其用于预测定位。
在这里,我们将问题颠倒过来,提出了一个概念验证,即缺乏序列保守性可能是定位预测的一个新特征。我们表明,对于酵母、哺乳动物和植物数据集,仅进化序列分歧就具有识别具有 N 端分选序列的序列的重要能力。此外,在自动定义的直系同源物集上计算序列分歧与在手工整理的直系同源物集上计算序列分歧一样有效。不幸的是,当与传统序列特征(如氨基酸组成)结合使用时,序列分歧不一定会提高分类性能。然而,对序列分歧改变预测的蛋白质进行的事后分析产生了一些具有非典型(即非 MPP 切割)基质靶向信号的蛋白质,以及一些错误注释。
我们报告了进化序列分歧作为蛋白质亚细胞定位预测特征的有效性的首次定量研究结果。我们表明,分歧确实对预测有用,但简单地通过将此特征添加到经典序列特征中,就提高整体准确性并非易事。尽管如此,我们认为序列分歧是一种很有前途的特征,并展示了一些轶事案例,其中它在其他特征失败的地方取得了成功。