Suppr超能文献

对真核生物和原核生物中非经典分泌蛋白进行预测的基因本体论术语排序。

Ranking Gene Ontology terms for predicting non-classical secretory proteins in eukaryotes and prokaryotes.

机构信息

Department of Management Information System, Asia Pacific Institute of Creativity, No. 110 XueFu Rd., Tou Fen, Miaoli, Taiwan, ROC.

出版信息

J Theor Biol. 2012 Nov 7;312:105-13. doi: 10.1016/j.jtbi.2012.07.027. Epub 2012 Aug 8.

Abstract

Protein secretion is an important biological process for both eukaryotes and prokaryotes. Several sequence-based methods mainly rely on utilizing various types of complementary features to design accurate classifiers for predicting non-classical secretory proteins. Gene Ontology (GO) terms are increasing informative in predicting protein functions. However, the number of used GO terms is often very large. For example, there are 60,020 GO terms used in the prediction method Euk-mPLoc 2.0 for subcellular localization. This study proposes a novel approach to identify a small set of m top-ranked GO terms served as the only type of input features to design a support vector machine (SVM) based method Sec-GO to predict non-classical secretory proteins in both eukaryotes and prokaryotes. To evaluate the Sec-GO method, two existing methods and their used datasets are adopted for performance comparisons. The Sec-GO method using m=436 GO terms yields an independent test accuracy of 96.7% on mammalian proteins, much better than the existing method SPRED (82.2%) which uses frequencies of tri-peptides and short peptides, secondary structure, and physicochemical properties as input features of a random forest classifier. Furthermore, when applying to Gram-positive bacterial proteins, the Sec-GO with m=158 GO terms has a test accuracy of 94.5%, superior to NClassG+ (90.0%) which uses SVM with several feature types, comprising amino acid composition, di-peptides, physicochemical properties and the position specific weighting matrix. Analysis of the distribution of secretory proteins in a GO database indicates the percentage of the non-classical secretory proteins annotated by GO is larger than that of classical secretory proteins in both eukaryotes and prokaryotes. Of the m top-ranked GO features, the top-four GO terms are all annotated by such subcellular locations as GO:0005576 (Extracellular region). Additionally, the method Sec-GO is easily implemented and its web tool of prediction is available at iclab.life.nctu.edu.tw/secgo.

摘要

蛋白质分泌是真核生物和原核生物的重要生物过程。几种基于序列的方法主要依赖于利用各种类型的互补特征来设计准确的分类器,以预测非经典分泌蛋白。基因本体论(GO)术语在预测蛋白质功能方面越来越有信息量。然而,使用的 GO 术语数量通常非常大。例如,亚细胞定位预测方法 Euk-mPLoc 2.0 使用了 60,020 个 GO 术语。本研究提出了一种新方法,该方法使用一组数量很少的排名最高的 m 个 GO 术语作为唯一类型的输入特征,设计基于支持向量机(SVM)的方法 Sec-GO,以预测真核生物和原核生物中的非经典分泌蛋白。为了评估 Sec-GO 方法,采用了两种现有的方法及其使用的数据集进行性能比较。Sec-GO 方法使用 m=436 个 GO 术语,在哺乳动物蛋白的独立测试中准确率达到 96.7%,明显优于使用三肽和短肽频率、二级结构和理化性质作为随机森林分类器输入特征的现有方法 SPRED(82.2%)。此外,当应用于革兰氏阳性细菌蛋白时,使用 m=158 个 GO 术语的 Sec-GO 具有 94.5%的测试准确率,优于使用包含氨基酸组成、二肽、理化性质和位置特异性加权矩阵等几种特征类型的 SVM 的 NClassG+(90.0%)。对 GO 数据库中分泌蛋白分布的分析表明,在真核生物和原核生物中,GO 注释的非经典分泌蛋白的百分比大于经典分泌蛋白。在 m 个排名最高的 GO 特征中,排名前四的 GO 术语都被 GO:0005576(细胞外区域)等亚细胞位置注释。此外,Sec-GO 方法易于实现,其预测的网络工具可在 iclab.life.nctu.edu.tw/secgo 上获得。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验