Suppr超能文献

利用基因本体论-氨基酸组成特征预测蛋白质亚核定位

Predicting protein subnuclear localization using GO-amino-acid composition features.

作者信息

Huang Wen-Lin, Tung Chun-Wei, Huang Hui-Ling, Ho Shinn-Ying

机构信息

Department of Management Information System, Chin Min Institute of Technology, Miaoli, Taiwan.

出版信息

Biosystems. 2009 Nov;98(2):73-9. doi: 10.1016/j.biosystems.2009.06.007. Epub 2009 Jul 5.

Abstract

The nucleus guides life processes of cells. Many of the nuclear proteins participating in the life processes tend to concentrate on subnuclear compartments. The subnuclear localization of nuclear proteins is hence important for deeply understanding the construction and functions of the nucleus. Recently, Gene Ontology (GO) annotation has been used for prediction of subnuclear localization. However, the effective use of GO terms in solving sequence-based prediction problems remains challenging, especially when query protein sequences have no accession number or annotated GO term. This study obtains homologies of query proteins with known accession numbers using BLAST to retrieve GO terms for sequence-based subnuclear localization prediction. A prediction method PGAC, which involves mining informative GO terms associated with amino acid composition features, is proposed to design a support vector machine-based classifier. PGAC yields 55 informative GO terms with training and test accuracies of 85.7% and 76.3%, respectively, using a data set SNL_35 (561 proteins in 9 localizations) with 35% sequence identity. Upon comparison with Nuc-PLoc, which combines amphiphilic pseudo amino acid composition of a protein with its position-specific scoring matrix, PGAC using the data set SNL_80 yields a leave-one-out cross-validation accuracy of 81.1%, which is better than that of Nuc-PLoc, 67.4%. Experimental results show that the set of informative GO terms are effective features for protein subnuclear localization. The prediction server based on PGAC has been implemented at http://iclab.life.nctu.edu.tw/prolocgac.

摘要

细胞核引导细胞的生命过程。许多参与生命过程的核蛋白倾向于集中在亚核区室。因此,核蛋白的亚核定位对于深入理解细胞核的结构和功能很重要。最近,基因本体论(GO)注释已用于亚核定位的预测。然而,有效利用GO术语解决基于序列的预测问题仍然具有挑战性,特别是当查询蛋白质序列没有登录号或注释的GO术语时。本研究使用BLAST获得具有已知登录号的查询蛋白质的同源性,以检索用于基于序列的亚核定位预测的GO术语。提出了一种预测方法PGAC,该方法涉及挖掘与氨基酸组成特征相关的信息丰富的GO术语,以设计基于支持向量机的分类器。使用具有35%序列同一性的数据集SNL_35(9个定位中的561种蛋白质),PGAC产生了55个信息丰富的GO术语,训练和测试准确率分别为85.7%和76.3%。与将蛋白质的两亲性伪氨基酸组成与其位置特异性评分矩阵相结合的Nuc-PLoc相比,使用数据集SNL_80的PGAC产生的留一法交叉验证准确率为81.1%,优于Nuc-PLoc的67.4%。实验结果表明,信息丰富的GO术语集是蛋白质亚核定位的有效特征。基于PGAC的预测服务器已在http://iclab.life.nctu.edu.tw/prolocgac上实现。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验