Institute of System Biology, Shanghai University, 99 ShangDa Road, 200244 Shanghai, China.
Mol Divers. 2010 Aug;14(3):551-8. doi: 10.1007/s11030-009-9182-4. Epub 2009 Aug 7.
Protein's subcellular location, which indicates where a protein resides in a cell, is an important characteristic of protein. Correctly assigning proteins to their subcellular locations would be of great help to the prediction of proteins' function, genome annotation, and drug design. Yet, in spite of great technical advance in the past decades, it is still time-consuming and laborious to experimentally determine protein subcellular locations on a high throughput scale. Hence, four integrated-algorithm methods were developed to fulfill such high throughput prediction in this article. Two data sets taken from the literature (Chou and Elrod, Protein Eng 12:107-118, 1999) were used as training set and test set, which consisted of 2,391 and 2,598 proteins, respectively. Amino acid composition was applied to represent the protein sequences. The jackknife cross-validation was used to test the training set. The final best integrated-algorithm predictor was constructed by integrating 10 algorithms in Weka (a software tool for tackling data mining tasks, http://www.cs.waikato.ac.nz/ml/weka/ ) based on an mRMR (Minimum Redundancy Maximum Relevance, http://research.janelia.org/peng/proj/mRMR/ ) method. It can achieve correct rate of 77.83 and 80.56% for the training set and test set, respectively, which is better than all of the 60 algorithms collected in Weka. This predicting software is available upon request.
蛋白质的亚细胞定位,即蛋白质在细胞中的位置,是蛋白质的一个重要特征。正确地将蛋白质分配到它们的亚细胞位置将有助于蛋白质功能的预测、基因组注释和药物设计。然而,尽管在过去几十年中取得了巨大的技术进步,但在高通量水平上实验确定蛋白质亚细胞位置仍然是耗时和费力的。因此,本文开发了四种集成算法方法来实现这种高通量预测。两个来自文献的数据集(Chou 和 Elrod,Protein Eng 12:107-118, 1999)被用作训练集和测试集,分别包含 2391 和 2598 种蛋白质。氨基酸组成被应用于表示蛋白质序列。Jackknife 交叉验证用于测试训练集。最终的最佳集成算法预测器是通过在 Weka(一个用于解决数据挖掘任务的软件工具,http://www.cs.waikato.ac.nz/ml/weka/)中基于 mRMR(最小冗余最大相关性,http://research.janelia.org/peng/proj/mRMR/)方法整合 10 种算法构建的。它可以分别为训练集和测试集实现 77.83%和 80.56%的正确率,优于 Weka 中收集的 60 种算法中的所有算法。这个预测软件可以根据需要提供。