pLoc_bal-mPlant：基于广义 PseAAC 和平衡训练数据集预测植物蛋白的亚细胞定位

pLoc_bal-mPlant: Predict Subcellular Localization of Plant Proteins by General PseAAC and Balancing Training Dataset.

机构信息

Computer Department, Jingdezhen Ceramic Institute, Jingdezhen, China.

The Gordon Life Science Institute, Boston, MA 02478, United States.

出版信息

Curr Pharm Des. 2018;24(34):4013-4022. doi: 10.2174/1381612824666181119145030.

DOI:10.2174/1381612824666181119145030

PMID:30451108

Abstract

Knowledge of protein subcellular localization is vitally important for both basic research and drug development. With the avalanche of protein sequences emerging in the post-genomic age, it is highly desired to develop computational tools for timely and effectively identifying their subcellular localization based on the sequence information alone. Recently, a predictor called "pLoc-mPlant" was developed for identifying the subcellular localization of plant proteins. Its performance is overwhelmingly better than that of the other predictors for the same purpose, particularly in dealing with multi-label systems in which some proteins, called "multiplex proteins", may simultaneously occur in two or more subcellular locations. Although it is indeed a very powerful predictor, more efforts are definitely needed to further improve it. This is because pLoc-mPlant was trained by an extremely skewed dataset in which some subsets (i.e., the protein numbers for some subcellular locations) were more than 10 times larger than the others. Accordingly, it cannot avoid the biased consequence caused by such an uneven training dataset. To overcome such biased consequence, we have developed a new and bias-free predictor called pLoc_bal-mPlant by balancing the training dataset. Cross-validation tests on exactly the same experimentconfirmed dataset have indicated that the proposed new predictor is remarkably superior to pLoc-mPlant, the existing state-of-the-art predictor in identifying the subcellular localization of plant proteins. To maximize the convenience for the majority of experimental scientists, a user-friendly web-server for the new predictor has been established at http://www.jci-bioinfo.cn/pLoc_bal-mPlant/, by which users can easily get their desired results without the need to go through the detailed mathematics.

摘要

蛋白质亚细胞定位的知识对于基础研究和药物开发都至关重要。在后基因组时代，大量涌现的蛋白质序列使得人们非常希望开发计算工具，仅根据序列信息就能够及时有效地识别它们的亚细胞定位。最近，开发了一种名为“pLoc-mPlant”的预测器，用于识别植物蛋白质的亚细胞定位。它的性能远远优于其他用于相同目的的预测器，特别是在处理多标签系统时，一些蛋白质，称为“多聚蛋白”，可能同时存在于两个或更多的亚细胞位置。尽管它确实是一个非常强大的预测器，但肯定需要更多的努力来进一步改进它。这是因为 pLoc-mPlant 是通过一个非常倾斜的数据集进行训练的，其中一些子集（即某些亚细胞位置的蛋白质数量）比其他子集大 10 倍以上。因此，它无法避免这种不平衡训练数据集所带来的有偏差的结果。为了克服这种有偏差的结果，我们通过平衡训练数据集开发了一种新的、无偏差的预测器，称为 pLoc_bal-mPlant。在完全相同的实验确认数据集上进行的交叉验证测试表明，所提出的新预测器在识别植物蛋白质的亚细胞定位方面明显优于现有的最先进的预测器 pLoc-mPlant。为了最大限度地方便大多数实验科学家，我们在 http://www.jci-bioinfo.cn/pLoc_bal-mPlant/ 上建立了一个新的预测器的用户友好型网络服务器，用户可以轻松地获得他们想要的结果，而无需了解详细的数学原理。