Sharma Alok, Dehzangi Abdollah, Lyons James, Imoto Seiya, Miyano Satoru, Nakai Kenta, Patil Ashwini
School of Engineering and Physics, The University of the South Pacific, Suva, Fiji ; Institute for Integrated and Intelligent Systems, Griffith University, Brisbane, Australia.
Institute for Integrated and Intelligent Systems, Griffith University, Brisbane, Australia ; National Information and Communication Technology Australia (NICTA), Brisbane, Australia.
PLoS One. 2014 Feb 24;9(2):e89890. doi: 10.1371/journal.pone.0089890. eCollection 2014.
With the exponential increase in the number of sequenced organisms, automated annotation of proteins is becoming increasingly important. Intrinsically disordered regions are known to play a significant role in protein function. Despite their abundance, especially in eukaryotes, they are rarely used to inform function prediction systems. In this study, we extracted seven sequence features in intrinsically disordered regions and developed a scheme to use them to predict Gene Ontology Slim terms associated with proteins. We evaluated the function prediction performance of each feature. Our results indicate that the residue composition based features have the highest precision while bigram probabilities, based on sequence profiles of intrinsically disordered regions obtained from PSIBlast, have the highest recall. Amino acid bigrams and features based on secondary structure show an intermediate level of precision and recall. Almost all features showed a high prediction performance for GO Slim terms related to extracellular matrix, nucleus, RNA and DNA binding. However, feature performance varied significantly for different GO Slim terms emphasizing the need for a unique classifier optimized for the prediction of each functional term. These findings provide a first comprehensive and quantitative evaluation of sequence features in intrinsically disordered regions and will help in the development of a more informative protein function predictor.
随着已测序生物体数量呈指数级增长,蛋白质的自动注释变得越来越重要。已知内在无序区域在蛋白质功能中发挥着重要作用。尽管它们数量众多,尤其是在真核生物中,但它们很少被用于为功能预测系统提供信息。在本研究中,我们在内在无序区域提取了七个序列特征,并开发了一种利用这些特征来预测与蛋白质相关的基因本体精简术语(Gene Ontology Slim terms)的方案。我们评估了每个特征的功能预测性能。我们的结果表明,基于残基组成的特征具有最高的精度,而基于从PSIBlast获得的内在无序区域序列概况的双字母概率具有最高的召回率。氨基酸双字母和基于二级结构的特征显示出中等水平的精度和召回率。几乎所有特征对于与细胞外基质、细胞核、RNA和DNA结合相关的基因本体精简术语都表现出较高的预测性能。然而,不同基因本体精简术语的特征性能差异显著,这强调了需要针对每个功能术语的预测优化独特的分类器。这些发现首次对内在无序区域的序列特征进行了全面且定量的评估,并将有助于开发更具信息性的蛋白质功能预测器。