Suppr超能文献

Hum-PLoc:一种用于预测人类蛋白质亚细胞定位的新型集成分类器。

Hum-PLoc: a novel ensemble classifier for predicting human protein subcellular localization.

作者信息

Chou Kuo-Chen, Shen Hong-Bin

机构信息

Gordon Life Science Institute, San Diego, CA 92130, USA.

出版信息

Biochem Biophys Res Commun. 2006 Aug 18;347(1):150-7. doi: 10.1016/j.bbrc.2006.06.059. Epub 2006 Jun 21.

Abstract

Predicting subcellular localization of human proteins is a challenging problem, especially when unknown query proteins do not have significant homology to proteins of known subcellular locations and when more locations need to be covered. To tackle the challenge, protein samples are expressed by hybridizing the gene ontology (GO) database and amphiphilic pseudo amino acid composition (PseAA). Based on such a representation frame, a novel ensemble classifier, called "Hum-PLoc", was developed by fusing many basic individual classifiers through a voting system. The "engine" of these basic classifiers was operated by the KNN (K-nearest neighbor) rule. As a demonstration, tests were performed with the ensemble classifier for human proteins among the following 12 locations: (1) centriole; (2) cytoplasm; (3) cytoskeleton; (4) endoplasmic reticulum; (5) extracell; (6) Golgi apparatus; (7) lysosome; (8) microsome; (9) mitochondrion; (10) nucleus; (11) peroxisome; (12) plasma membrane. To get rid of redundancy and homology bias, none of the proteins investigated here had > or = 25% sequence identity to any other in a same subcellular location. The overall success rates thus obtained via the jackknife cross-validation test and independent dataset test were 81.1% and 85.0%, respectively, which are more than 50% higher than those obtained by the other existing methods on the same stringent datasets. Furthermore, an incisive and compelling analysis was given to elucidate that the overwhelmingly high success rate obtained by the new predictor is by no means due to a trivial utilization of the GO annotations. This is because, for those proteins with "subcellular location unknown" annotation in Swiss-Prot database, most (more than 99%) of their corresponding GO numbers in GO database are also annotated with "cellular component unknown". The information and clues for predicting subcellular locations of proteins are actually buried into a series of tedious GO numbers, just like they are buried into a pile of complicated amino acid sequences although with a different manner and "depth". To dig out the knowledge about their locations, a sophisticated operation engine is needed. And the current predictor is one of these kinds, and has proved to be a very powerful one. The Hum-PLoc classifier is available as a web-server at http://202.120.37.186/bioinf/hum.

摘要

预测人类蛋白质的亚细胞定位是一个具有挑战性的问题,尤其是当未知的查询蛋白质与已知亚细胞定位的蛋白质没有显著同源性,且需要覆盖更多定位时。为应对这一挑战,通过将基因本体(GO)数据库与两亲性伪氨基酸组成(PseAA)进行杂交来表达蛋白质样本。基于这样的表示框架,通过投票系统融合许多基本的个体分类器,开发了一种名为 “Hum-PLoc” 的新型集成分类器。这些基本分类器的 “引擎” 由KNN(K近邻)规则运行。作为示例,使用该集成分类器对人类蛋白质在以下12个定位中进行了测试:(1)中心粒;(2)细胞质;(3)细胞骨架;(4)内质网;(5)细胞外;(6)高尔基体;(7)溶酶体;(8)微粒体;(9)线粒体;(10)细胞核;(11)过氧化物酶体;(12)质膜。为消除冗余和同源性偏差,这里研究的蛋白质在同一亚细胞定位中与其他任何蛋白质的序列同一性均不超过25%。通过留一法交叉验证测试和独立数据集测试获得的总体成功率分别为81.1%和85.0%,比在相同严格数据集上通过其他现有方法获得的成功率高出50%以上。此外,还进行了深刻且有说服力的分析,以阐明新预测器获得的极高成功率绝不是由于对GO注释的简单利用。这是因为,对于瑞士蛋白质数据库中注释为 “亚细胞定位未知” 的那些蛋白质,其在GO数据库中对应的大多数(超过99%)GO编号也被注释为 “细胞成分未知”。预测蛋白质亚细胞定位的信息和线索实际上隐藏在一系列繁琐的GO编号中,就如同它们隐藏在一堆复杂的氨基酸序列中一样,只是方式和 “深度” 不同。要挖掘出关于它们定位的知识,需要一个复杂的操作引擎。而当前的预测器就是其中之一,并且已被证明是非常强大的。Hum-PLoc分类器可作为网络服务器在http://202.120.37.186/bioinf/hum上获取。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验