基于序列衍生特性，采用机器学习方法鉴定蛋白质功能。

Identification of protein functions using a machine-learning approach based on sequence-derived properties.

作者信息

Lee Bum Ju, Shin Moon Sun, Oh Young Joon, Oh Hae Seok, Ryu Keun Ho

机构信息

Industrial Research Center, Jungwon University, Chungbuk, Republic of Korea.

出版信息

Proteome Sci. 2009 Aug 9;7:27. doi: 10.1186/1477-5956-7-27.

DOI:10.1186/1477-5956-7-27

PMID:19664241

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2731080/

Abstract

BACKGROUND

Predicting the function of an unknown protein is an essential goal in bioinformatics. Sequence similarity-based approaches are widely used for function prediction; however, they are often inadequate in the absence of similar sequences or when the sequence similarity among known protein sequences is statistically weak. This study aimed to develop an accurate prediction method for identifying protein function, irrespective of sequence and structural similarities.

RESULTS

A highly accurate prediction method capable of identifying protein function, based solely on protein sequence properties, is described. This method analyses and identifies specific features of the protein sequence that are highly correlated with certain protein functions and determines the combination of protein sequence features that best characterises protein function. Thirty-three features that represent subtle differences in local regions and full regions of the protein sequences were introduced. On the basis of 484 features extracted solely from the protein sequence, models were built to predict the functions of 11 different proteins from a broad range of cellular components, molecular functions, and biological processes. The accuracy of protein function prediction using random forests with feature selection ranged from 94.23% to 100%. The local sequence information was found to have a broad range of applicability in predicting protein function.

CONCLUSION

We present an accurate prediction method using a machine-learning approach based solely on protein sequence properties. The primary contribution of this paper is to propose new PNPRD features representing global and/or local differences in sequences, based on positively and/or negatively charged residues, to assist in predicting protein function. In addition, we identified a compact and useful feature subset for predicting the function of various proteins. Our results indicate that sequence-based classifiers can provide good results among a broad range of proteins, that the proposed features are useful in predicting several functions, and that the combination of our and traditional features may support the creation of a discriminative feature set for specific protein functions.

摘要

背景

预测未知蛋白质的功能是生物信息学的一个重要目标。基于序列相似性的方法被广泛用于功能预测；然而，在缺乏相似序列或已知蛋白质序列之间的序列相似性在统计学上较弱时，它们往往并不适用。本研究旨在开发一种准确的预测方法，用于识别蛋白质功能，而不考虑序列和结构相似性。

结果

描述了一种仅基于蛋白质序列特性就能识别蛋白质功能的高度准确的预测方法。该方法分析并识别与某些蛋白质功能高度相关的蛋白质序列的特定特征，并确定最能表征蛋白质功能的蛋白质序列特征组合。引入了33个代表蛋白质序列局部区域和完整区域细微差异的特征。基于仅从蛋白质序列中提取的484个特征，构建模型以预测来自广泛细胞成分、分子功能和生物过程的11种不同蛋白质的功能。使用带有特征选择的随机森林进行蛋白质功能预测的准确率在94.23%至100%之间。发现局部序列信息在预测蛋白质功能方面具有广泛的适用性。

结论

我们提出了一种仅基于蛋白质序列特性的机器学习方法的准确预测方法。本文的主要贡献是基于带正电和/或带负电的残基，提出了代表序列全局和/或局部差异的新PNPRD特征，以协助预测蛋白质功能。此外，我们确定了一个紧凑且有用的特征子集，用于预测各种蛋白质的功能。我们的结果表明，基于序列的分类器在广泛的蛋白质中能提供良好的结果，所提出的特征在预测多种功能方面是有用的，并且我们的特征与传统特征的组合可能支持为特定蛋白质功能创建一个有区分力的特征集。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9bb6/2731080/1e7580bb62ad/1477-5956-7-27-1.jpg

相似文献

Identification of protein functions using a machine-learning approach based on sequence-derived properties.基于序列衍生特性，采用机器学习方法鉴定蛋白质功能。

Proteome Sci. 2009 Aug 9;7:27. doi: 10.1186/1477-5956-7-27.

Protein disorder prediction by condensed PSSM considering propensity for order or disorder.基于考虑有序或无序倾向的精简位置特异性得分矩阵进行蛋白质无序预测。

BMC Bioinformatics. 2006 Jun 23;7:319. doi: 10.1186/1471-2105-7-319.

SCPRED: accurate prediction of protein structural class for sequences of twilight-zone similarity with predicting sequences.SCPRED：对与预测序列具有模糊相似性的序列的蛋白质结构类别进行准确预测。

BMC Bioinformatics. 2008 May 1;9:226. doi: 10.1186/1471-2105-9-226.

APIS: accurate prediction of hot spots in protein interfaces by combining protrusion index with solvent accessibility.APIS：通过结合突出指数和溶剂可及性来准确预测蛋白质界面热点。

BMC Bioinformatics. 2010 Apr 8;11:174. doi: 10.1186/1471-2105-11-174.

Capreomycin resistance prediction in two species of Mycobacterium using a stacked ensemble method.利用堆叠集成方法预测两种分枝杆菌中的卷曲霉素耐药性。

J Appl Microbiol. 2019 Dec;127(6):1656-1664. doi: 10.1111/jam.14413. Epub 2019 Sep 8.

Computational identification of ubiquitylation sites from protein sequences.从蛋白质序列中通过计算方法鉴定泛素化位点

BMC Bioinformatics. 2008 Jul 15;9:310. doi: 10.1186/1471-2105-9-310.

FrankSum: new feature selection method for protein function prediction.

Int J Neural Syst. 2005 Aug;15(4):259-75. doi: 10.1142/S0129065705000281.

Modular prediction of protein structural classes from sequences of twilight-zone identity with predicting sequences.从与预测序列具有 twilight-zone 身份的序列中预测蛋白质结构类别

BMC Bioinformatics. 2009 Dec 13;10:414. doi: 10.1186/1471-2105-10-414.

Prediction of Protein Subcellular Localization Based on Fusion of Multi-view Features.基于多视图特征融合的蛋白质亚细胞定位预测。

Molecules. 2019 Mar 6;24(5):919. doi: 10.3390/molecules24050919.

引用本文的文献

Synthetic Biology Strategies and Tools to Modulate Photosynthesis in Microbes.用于调控微生物光合作用的合成生物学策略与工具

Int J Mol Sci. 2025 Mar 28;26(7):3116. doi: 10.3390/ijms26073116.

The Novel Role of Tyrosinase Enzymes in the Storage of Globally Significant Amounts of Carbon in Wetland Ecosystems.酪氨酸酶在湿地生态系统中储存具有全球意义的大量碳的新作用。

Environ Sci Technol. 2022 Sep 6;56(17):11952-11968. doi: 10.1021/acs.est.2c03770. Epub 2022 Aug 9.

Mitotic chromosome binding predicts transcription factor properties in interphase.有丝分裂染色体结合可预测间期转录因子的性质。

Nat Commun. 2019 Jan 30;10(1):487. doi: 10.1038/s41467-019-08417-5.

Predicting human protein function with multi-task deep neural networks.用多任务深度神经网络预测人类蛋白质功能。

PLoS One. 2018 Jun 11;13(6):e0198216. doi: 10.1371/journal.pone.0198216. eCollection 2018.

Consistent prediction of GO protein localization.GO 蛋白定位的一致性预测。

Sci Rep. 2018 May 17;8(1):7757. doi: 10.1038/s41598-018-26041-z.

GRAFENE: Graphlet-based alignment-free network approach integrates 3D structural and sequence (residue order) data to improve protein structural comparison.基于图元的无比对网络方法整合了 3D 结构和序列（残基顺序）数据，以改进蛋白质结构比对。

Sci Rep. 2017 Nov 2;7(1):14890. doi: 10.1038/s41598-017-14411-y.

A Meta-Analysis Based Method for Prioritizing Candidate Genes Involved in a Pre-specific Function.一种基于荟萃分析的方法，用于对参与特定功能的候选基因进行优先级排序。

Front Plant Sci. 2016 Dec 15;7:1914. doi: 10.3389/fpls.2016.01914. eCollection 2016.

A Factor Graph Approach to Automated GO Annotation.一种用于自动基因本体注释的因子图方法。

PLoS One. 2016 Jan 15;11(1):e0146986. doi: 10.1371/journal.pone.0146986. eCollection 2016.

A survey of computational intelligence techniques in protein function prediction.蛋白质功能预测中的计算智能技术综述。

Int J Proteomics. 2014;2014:845479. doi: 10.1155/2014/845479. Epub 2014 Dec 11.

Prediction of detailed enzyme functions and identification of specificity determining residues by random forests.通过随机森林预测详细的酶功能和鉴定特异性决定残基。

PLoS One. 2014 Jan 8;9(1):e84623. doi: 10.1371/journal.pone.0084623. eCollection 2014.

本文引用的文献

The FEATURE framework for protein function annotation: modeling new functions, improving performance, and extending to novel applications.用于蛋白质功能注释的FEATURE框架：对新功能进行建模、提高性能并扩展到新应用。

BMC Genomics. 2008 Sep 16;9 Suppl 2(Suppl 2):S2. doi: 10.1186/1471-2164-9-S2-S2.

Optimizing amino acid groupings for GPCR classification.

Bioinformatics. 2008 Sep 15;24(18):1980-6. doi: 10.1093/bioinformatics/btn382. Epub 2008 Aug 1.

Enriched random forests.增强随机森林

Bioinformatics. 2008 Sep 15;24(18):2010-4. doi: 10.1093/bioinformatics/btn356. Epub 2008 Jul 22.

A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification.基于微阵列的癌症分类中随机森林与支持向量机的全面比较

BMC Bioinformatics. 2008 Jul 22;9:319. doi: 10.1186/1471-2105-9-319.

Computational identification of ubiquitylation sites from protein sequences.从蛋白质序列中通过计算方法鉴定泛素化位点

BMC Bioinformatics. 2008 Jul 15;9:310. doi: 10.1186/1471-2105-9-310.

Conditional variable importance for random forests.随机森林的条件变量重要性

BMC Bioinformatics. 2008 Jul 11;9:307. doi: 10.1186/1471-2105-9-307.

Predicting gene function in a hierarchical context with an ensemble of classifiers.使用分类器集成在分层背景下预测基因功能。

Genome Biol. 2008;9 Suppl 1(Suppl 1):S3. doi: 10.1186/gb-2008-9-s1-s3. Epub 2008 Jun 27.

The combination approach of SVM and ECOC for powerful identification and classification of transcription factor.支持向量机（SVM）和纠错输出编码（ECOC）相结合的方法用于转录因子的高效识别和分类。

BMC Bioinformatics. 2008 Jun 16;9:282. doi: 10.1186/1471-2105-9-282.

Classification of premalignant pancreatic cancer mass-spectrometry data using decision tree ensembles.使用决策树集成对癌前胰腺癌质谱数据进行分类。

BMC Bioinformatics. 2008 Jun 11;9:275. doi: 10.1186/1471-2105-9-275.

A comparison of machine learning algorithms for chemical toxicity classification using a simulated multi-scale data model.使用模拟多尺度数据模型对化学毒性分类的机器学习算法比较

BMC Bioinformatics. 2008 May 19;9:241. doi: 10.1186/1471-2105-9-241.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于序列衍生特性，采用机器学习方法鉴定蛋白质功能。

Identification of protein functions using a machine-learning approach based on sequence-derived properties.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSION

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献