Chou Kuo-Chen, Cheng Xiang, Xiao Xuan
Gordon Life Science Institute, Boston, MA 02478, United States.
Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China.
Med Chem. 2019;15(5):472-485. doi: 10.2174/1573406415666181218102517.
BACKGROUND/OBJECTIVE: Information of protein subcellular localization is crucially important for both basic research and drug development. With the explosive growth of protein sequences discovered in the post-genomic age, it is highly demanded to develop powerful bioinformatics tools for timely and effectively identifying their subcellular localization purely based on the sequence information alone. Recently, a predictor called "pLoc-mEuk" was developed for identifying the subcellular localization of eukaryotic proteins. Its performance is overwhelmingly better than that of the other predictors for the same purpose, particularly in dealing with multi-label systems where many proteins, called "multiplex proteins", may simultaneously occur in two or more subcellular locations. Although it is indeed a very powerful predictor, more efforts are definitely needed to further improve it. This is because pLoc-mEuk was trained by an extremely skewed dataset where some subset was about 200 times the size of the other subsets. Accordingly, it cannot avoid the biased consequence caused by such an uneven training dataset.
To alleviate such bias, we have developed a new predictor called pLoc_bal-mEuk by quasi-balancing the training dataset. Cross-validation tests on exactly the same experimentconfirmed dataset have indicated that the proposed new predictor is remarkably superior to pLocmEuk, the existing state-of-the-art predictor in identifying the subcellular localization of eukaryotic proteins. It has not escaped our notice that the quasi-balancing treatment can also be used to deal with many other biological systems.
To maximize the convenience for most experimental scientists, a user-friendly web-server for the new predictor has been established at http://www.jci-bioinfo.cn/pLoc_bal-mEuk/.
It is anticipated that the pLoc_bal-Euk predictor holds very high potential to become a useful high throughput tool in identifying the subcellular localization of eukaryotic proteins, particularly for finding multi-target drugs that is currently a very hot trend trend in drug development.
背景/目的:蛋白质亚细胞定位信息对于基础研究和药物开发都至关重要。随着后基因组时代发现的蛋白质序列呈爆炸式增长,迫切需要开发强大的生物信息学工具,以便仅基于序列信息及时、有效地识别其亚细胞定位。最近,开发了一种名为“pLoc-mEuk”的预测器,用于识别真核生物蛋白质的亚细胞定位。其性能在用于相同目的的其他预测器中具有压倒性优势,特别是在处理多标签系统时,许多蛋白质(称为“多重蛋白质”)可能同时出现在两个或更多亚细胞位置。尽管它确实是一个非常强大的预测器,但肯定需要进一步努力来进一步改进它。这是因为pLoc-mEuk是由一个极度不均衡的数据集训练的,其中一些子集的大小约为其他子集的200倍。因此,它无法避免由这种不均衡的训练数据集导致的偏差后果。
为了减轻这种偏差,我们通过对训练数据集进行准平衡开发了一种名为pLoc_bal-mEuk的新预测器。在完全相同的实验确认数据集上进行的交叉验证测试表明,所提出的新预测器在识别真核生物蛋白质的亚细胞定位方面明显优于现有的最先进预测器pLocmEuk。我们也注意到,这种准平衡处理也可用于处理许多其他生物系统。
为了最大限度地方便大多数实验科学家,已在http://www.jci-bioinfo.cn/pLoc_bal-mEuk/建立了一个用于新预测器的用户友好型网络服务器。
预计pLoc_bal-Euk预测器在识别真核生物蛋白质的亚细胞定位方面具有很高的潜力,可成为一种有用的高通量工具,特别是在寻找多靶点药物方面,这是目前药物开发中一个非常热门的趋势。