Tung Thai Quang, Lee Doheon
Department of Bio & Brain Engineering, KAIST, Daejeon City, Republic of Korea.
BMC Bioinformatics. 2009 Jan 30;10 Suppl 1(Suppl 1):S43. doi: 10.1186/1471-2105-10-S1-S43.
Protein subcellular localization is crucial information to elucidate protein functions. Owing to the need for large-scale genome analysis, computational method for efficiently predicting protein subcellular localization is highly required. Although many previous works have been done for this task, the problem is still challenging due to several reasons: the number of subcellular locations in practice is large; distribution of protein in locations is imbalanced, that is the number of protein in each location remarkably different; and there are many proteins located in multiple locations. Thus it is necessary to explore new features and appropriate classification methods to improve the prediction performance.
In this paper we propose a new predicting method which combines two key ideas: 1) Information of neighbour proteins in a probabilistic gene network is integrated to enrich the prediction features. 2) Fuzzy k-NN, a classification method based on fuzzy set theory is applied to predict protein locating in multiple sites. Experiment was conducted on a dataset consisting of 22 locations from Budding yeast proteins and significant improvement was observed.
Our results suggest that the neighbourhood information from functional gene networks is predictive to subcellular localization. The proposed method thus can be integrated and complementary to other available prediction methods.
蛋白质亚细胞定位是阐明蛋白质功能的关键信息。由于大规模基因组分析的需求,高效预测蛋白质亚细胞定位的计算方法非常必要。尽管此前已经针对此任务开展了许多工作,但由于以下几个原因,该问题仍然具有挑战性:实际中亚细胞定位的数量众多;蛋白质在各定位中的分布不均衡,即每个定位中蛋白质的数量差异显著;并且存在许多蛋白质位于多个定位中。因此,有必要探索新的特征和合适的分类方法以提高预测性能。
在本文中,我们提出了一种新的预测方法,该方法结合了两个关键思想:1)整合概率基因网络中相邻蛋白质的信息以丰富预测特征。2)应用基于模糊集理论的分类方法模糊k近邻算法来预测位于多个位点的蛋白质。在一个由芽殖酵母蛋白质的22个定位组成的数据集上进行了实验,并观察到了显著的改进。
我们的结果表明,来自功能基因网络的邻域信息对亚细胞定位具有预测性。因此,所提出的方法可以与其他可用的预测方法相结合并相互补充。