Lee KiYoung, Kim Dae-Won, Na DoKyun, Lee Kwang H, Lee Doheon
Department of BioSystems, KAIST, Daejeon City, Republic of Korea.
Nucleic Acids Res. 2006;34(17):4655-66. doi: 10.1093/nar/gkl638. Epub 2006 Sep 11.
Subcellular localization is one of the key functional characteristics of proteins. An automatic and efficient prediction method for the protein subcellular localization is highly required owing to the need for large-scale genome analysis. From a machine learning point of view, a dataset of protein localization has several characteristics: the dataset has too many classes (there are more than 10 localizations in a cell), it is a multi-label dataset (a protein may occur in several different subcellular locations), and it is too imbalanced (the number of proteins in each localization is remarkably different). Even though many previous works have been done for the prediction of protein subcellular localization, none of them tackles effectively these characteristics at the same time. Thus, a new computational method for protein localization is eventually needed for more reliable outcomes. To address the issue, we present a protein localization predictor based on D-SVDD (PLPD) for the prediction of protein localization, which can find the likelihood of a specific localization of a protein more easily and more correctly. Moreover, we introduce three measurements for the more precise evaluation of a protein localization predictor. As the results of various datasets which are made from the experiments of Huh et al. (2003), the proposed PLPD method represents a different approach that might play a complimentary role to the existing methods, such as Nearest Neighbor method and discriminate covariant method. Finally, after finding a good boundary for each localization using the 5184 classified proteins as training data, we predicted 138 proteins whose subcellular localizations could not be clearly observed by the experiments of Huh et al. (2003).
亚细胞定位是蛋白质的关键功能特性之一。由于大规模基因组分析的需求,迫切需要一种自动且高效的蛋白质亚细胞定位预测方法。从机器学习的角度来看,蛋白质定位数据集具有几个特点:该数据集类别过多(细胞中有超过10种定位),是一个多标签数据集(一种蛋白质可能出现在几个不同的亚细胞位置),并且严重失衡(每个定位中蛋白质的数量差异显著)。尽管之前已经有许多关于蛋白质亚细胞定位预测的工作,但没有一项能同时有效解决这些特点。因此,最终需要一种新的蛋白质定位计算方法来获得更可靠的结果。为了解决这个问题,我们提出了一种基于D-SVDD的蛋白质定位预测器(PLPD)来预测蛋白质定位,它能够更轻松、更准确地找到蛋白质特定定位的可能性。此外,我们引入了三种测量方法来更精确地评估蛋白质定位预测器。作为基于Huh等人(2003年)实验所构建的各种数据集的结果,所提出的PLPD方法代表了一种不同的方法,可能对现有方法(如最近邻方法和判别协变方法)起到补充作用。最后,使用5184个分类蛋白质作为训练数据为每个定位找到一个良好的边界后,我们预测了138个蛋白质的亚细胞定位,这些定位在Huh等人(2003年)的实验中无法清晰观察到。