Xiao Xuan, Cheng Xiang, Chen Genqiang, Mao Qi, Chou Kuo-Chen
Computer Department, Jingdezhen Ceramic Institute, Jingdezhen, China.
Gordon Life Science Institute, Boston, MA 02478, United States.
Med Chem. 2019;15(5):496-509. doi: 10.2174/1573406415666181217114710.
BACKGROUND/OBJECTIVE: Knowledge of protein subcellular localization is vitally important for both basic research and drug development. Facing the avalanche of protein sequences emerging in the post-genomic age, it is urgent to develop computational tools for timely and effectively identifying their subcellular localization based on the sequence information alone. Recently, a predictor called "pLoc-mVirus" was developed for identifying the subcellular localization of virus proteins. Its performance is overwhelmingly better than that of the other predictors for the same purpose, particularly in dealing with multi-label systems in which some proteins, known as "multiplex proteins", may simultaneously occur in, or move between two or more subcellular location sites. Despite the fact that it is indeed a very powerful predictor, more efforts are definitely needed to further improve it. This is because pLoc-mVirus was trained by an extremely skewed dataset in which some subset was over 10 times the size of the other subsets. Accordingly, it cannot avoid the biased consequence caused by such an uneven training dataset.
Using the Chou's general PseAAC (Pseudo Amino Acid Composition) approach and the IHTS (Inserting Hypothetical Training Samples) treatment to balance out the training dataset, we have developed a new predictor called "pLoc_bal-mVirus" for predicting the subcellular localization of multi-label virus proteins.
Cross-validation tests on exactly the same experiment-confirmed dataset have indicated that the proposed new predictor is remarkably superior to pLoc-mVirus, the existing state-of-theart predictor for the same purpose.
Its user-friendly web-server is available at http://www.jci-bioinfo.cn/pLoc_balmVirus/, by which the majority of experimental scientists can easily get their desired results without the need to go through the detailed complicated mathematics. Accordingly, pLoc_bal-mVirus will become a very useful tool for designing multi-target drugs and in-depth understanding of the biological process in a cell.
背景/目的:蛋白质亚细胞定位的知识对于基础研究和药物开发都至关重要。面对后基因组时代涌现的大量蛋白质序列,迫切需要开发计算工具,以便仅基于序列信息及时、有效地识别它们的亚细胞定位。最近,开发了一种名为“pLoc-mVirus”的预测器,用于识别病毒蛋白的亚细胞定位。对于相同目的,其性能比其他预测器要好得多,特别是在处理多标签系统时,其中一些蛋白质,即“多重蛋白”,可能同时出现在两个或更多亚细胞定位位点,或在这些位点之间移动。尽管它确实是一个非常强大的预测器,但仍肯定需要更多努力来进一步改进它。这是因为pLoc-mVirus是由一个极度不均衡的数据集训练的,其中一些子集的大小是其他子集的10倍以上。因此,它无法避免由这种不均衡训练数据集导致的偏差后果。
我们使用周的通用伪氨基酸组成(PseAAC)方法和插入假设训练样本(IHTS)处理来平衡训练数据集,开发了一种名为“pLoc_bal-mVirus”的新预测器,用于预测多标签病毒蛋白的亚细胞定位。
在完全相同的经实验确认的数据集上进行的交叉验证测试表明,所提出的新预测器明显优于pLoc-mVirus,即现有的用于相同目的的最先进预测器。