Computer Department, Jingdezhen Ceramic Institute, Jingdezhen, China; The Gordon Life Science Institute, Boston, MA 02478, USA.
Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China; The Gordon Life Science Institute, Boston, MA 02478, USA.
J Theor Biol. 2018 Dec 7;458:92-102. doi: 10.1016/j.jtbi.2018.09.005. Epub 2018 Sep 8.
One of the hottest topics in molecular cell biology is to determine the subcellular localization of proteins from various different organisms. This is because it is crucially important for both basic research and drug development. Recently, a predictor called "pLoc-mGneg" was developed for identifying the subcellular localization of Gram-negative bacterial proteins. Its performance is overwhelmingly better than that of the other predictors for the same purpose, particularly in dealing with multi-label systems in which some proteins, called "multiplex proteins", may simultaneously occur in two or more subcellular locations. Although it is indeed a very powerful predictor, more efforts are definitely needed to further improve it. This is because pLoc-mGneg was trained by an extremely skewed dataset in which some subset (subcellular location) was about 5 to 70 times the size of the other subsets. Accordingly, it cannot avoid the biased consequence caused by such an uneven training dataset. To alleviate such a consequence, we have developed a new and bias-reducing predictor called pLoc_bal-mGneg by quasi-balancing the training dataset. Cross-validation tests on exactly the same experiment-confirmed dataset have indicated that the proposed new predictor is remarkably superior to pLoc-mGneg, the existing state-of-the-art predictor in identifying the subcellular localization of Gram-negative bacterial proteins. To maximize the convenience for most experimental scientists, a user-friendly web-server for the new predictor has been established at http://www.jci-bioinfo.cn/pLoc_bal-mGneg/, by which users can easily get their desired results without the need to go through the detailed mathematics.
革兰氏阴性菌蛋白定位新方法
分子细胞生物学领域最热门的课题之一就是确定来自不同生物体的蛋白质的亚细胞定位。这对于基础研究和药物开发都至关重要。最近,一种名为“pLoc-mGneg”的预测器被开发出来,用于识别革兰氏阴性细菌蛋白的亚细胞定位。它的性能远远优于其他用于同一目的的预测器,特别是在处理多标签系统时,其中一些被称为“多聚体蛋白”的蛋白质可能同时存在于两个或更多的亚细胞位置。虽然它确实是一个非常强大的预测器,但肯定需要更多的努力来进一步改进它。这是因为 pLoc-mGneg 是通过一个极度偏斜的数据集进行训练的,其中一些子集(亚细胞位置)的大小是其他子集的 5 到 70 倍。因此,它无法避免由这种不均匀的训练数据集引起的偏差结果。为了减轻这种后果,我们开发了一种新的、减少偏差的预测器,称为 pLoc_bal-mGneg,通过准平衡训练数据集来实现。在完全相同的实验确认数据集上进行的交叉验证测试表明,所提出的新预测器在识别革兰氏阴性细菌蛋白的亚细胞定位方面明显优于现有的最先进的预测器 pLoc-mGneg。为了最大限度地方便大多数实验科学家,我们在 http://www.jci-bioinfo.cn/pLoc_bal-mGneg/ 上建立了一个新的预测器的用户友好型网络服务器,用户可以轻松地获得他们所需的结果,而无需经历详细的数学运算。