Lee Bum Ju, Shin Moon Sun, Oh Young Joon, Oh Hae Seok, Ryu Keun Ho
Industrial Research Center, Jungwon University, Chungbuk, Republic of Korea.
Proteome Sci. 2009 Aug 9;7:27. doi: 10.1186/1477-5956-7-27.
Predicting the function of an unknown protein is an essential goal in bioinformatics. Sequence similarity-based approaches are widely used for function prediction; however, they are often inadequate in the absence of similar sequences or when the sequence similarity among known protein sequences is statistically weak. This study aimed to develop an accurate prediction method for identifying protein function, irrespective of sequence and structural similarities.
A highly accurate prediction method capable of identifying protein function, based solely on protein sequence properties, is described. This method analyses and identifies specific features of the protein sequence that are highly correlated with certain protein functions and determines the combination of protein sequence features that best characterises protein function. Thirty-three features that represent subtle differences in local regions and full regions of the protein sequences were introduced. On the basis of 484 features extracted solely from the protein sequence, models were built to predict the functions of 11 different proteins from a broad range of cellular components, molecular functions, and biological processes. The accuracy of protein function prediction using random forests with feature selection ranged from 94.23% to 100%. The local sequence information was found to have a broad range of applicability in predicting protein function.
We present an accurate prediction method using a machine-learning approach based solely on protein sequence properties. The primary contribution of this paper is to propose new PNPRD features representing global and/or local differences in sequences, based on positively and/or negatively charged residues, to assist in predicting protein function. In addition, we identified a compact and useful feature subset for predicting the function of various proteins. Our results indicate that sequence-based classifiers can provide good results among a broad range of proteins, that the proposed features are useful in predicting several functions, and that the combination of our and traditional features may support the creation of a discriminative feature set for specific protein functions.
预测未知蛋白质的功能是生物信息学的一个重要目标。基于序列相似性的方法被广泛用于功能预测;然而,在缺乏相似序列或已知蛋白质序列之间的序列相似性在统计学上较弱时,它们往往并不适用。本研究旨在开发一种准确的预测方法,用于识别蛋白质功能,而不考虑序列和结构相似性。
描述了一种仅基于蛋白质序列特性就能识别蛋白质功能的高度准确的预测方法。该方法分析并识别与某些蛋白质功能高度相关的蛋白质序列的特定特征,并确定最能表征蛋白质功能的蛋白质序列特征组合。引入了33个代表蛋白质序列局部区域和完整区域细微差异的特征。基于仅从蛋白质序列中提取的484个特征,构建模型以预测来自广泛细胞成分、分子功能和生物过程的11种不同蛋白质的功能。使用带有特征选择的随机森林进行蛋白质功能预测的准确率在94.23%至100%之间。发现局部序列信息在预测蛋白质功能方面具有广泛的适用性。
我们提出了一种仅基于蛋白质序列特性的机器学习方法的准确预测方法。本文的主要贡献是基于带正电和/或带负电的残基,提出了代表序列全局和/或局部差异的新PNPRD特征,以协助预测蛋白质功能。此外,我们确定了一个紧凑且有用的特征子集,用于预测各种蛋白质的功能。我们的结果表明,基于序列的分类器在广泛的蛋白质中能提供良好的结果,所提出的特征在预测多种功能方面是有用的,并且我们的特征与传统特征的组合可能支持为特定蛋白质功能创建一个有区分力的特征集。