pLoc_bal-mVirus：基于周式广义伪氨基酸组成和用于平衡训练数据集的迭代启发式阈值选择处理预测多标签病毒蛋白的亚细胞定位

pLoc_bal-mVirus: Predict Subcellular Localization of Multi-Label Virus Proteins by Chou's General PseAAC and IHTS Treatment to Balance Training Dataset.

作者信息

Xiao Xuan, Cheng Xiang, Chen Genqiang, Mao Qi, Chou Kuo-Chen

机构信息

Computer Department, Jingdezhen Ceramic Institute, Jingdezhen, China.

Gordon Life Science Institute, Boston, MA 02478, United States.

出版信息

Med Chem. 2019;15(5):496-509. doi: 10.2174/1573406415666181217114710.

DOI:10.2174/1573406415666181217114710

PMID:30556503

Abstract

BACKGROUND/OBJECTIVE: Knowledge of protein subcellular localization is vitally important for both basic research and drug development. Facing the avalanche of protein sequences emerging in the post-genomic age, it is urgent to develop computational tools for timely and effectively identifying their subcellular localization based on the sequence information alone. Recently, a predictor called "pLoc-mVirus" was developed for identifying the subcellular localization of virus proteins. Its performance is overwhelmingly better than that of the other predictors for the same purpose, particularly in dealing with multi-label systems in which some proteins, known as "multiplex proteins", may simultaneously occur in, or move between two or more subcellular location sites. Despite the fact that it is indeed a very powerful predictor, more efforts are definitely needed to further improve it. This is because pLoc-mVirus was trained by an extremely skewed dataset in which some subset was over 10 times the size of the other subsets. Accordingly, it cannot avoid the biased consequence caused by such an uneven training dataset.

METHODS

Using the Chou's general PseAAC (Pseudo Amino Acid Composition) approach and the IHTS (Inserting Hypothetical Training Samples) treatment to balance out the training dataset, we have developed a new predictor called "pLoc_bal-mVirus" for predicting the subcellular localization of multi-label virus proteins.

RESULTS

Cross-validation tests on exactly the same experiment-confirmed dataset have indicated that the proposed new predictor is remarkably superior to pLoc-mVirus, the existing state-of-theart predictor for the same purpose.

CONCLUSION

Its user-friendly web-server is available at http://www.jci-bioinfo.cn/pLoc_balmVirus/, by which the majority of experimental scientists can easily get their desired results without the need to go through the detailed complicated mathematics. Accordingly, pLoc_bal-mVirus will become a very useful tool for designing multi-target drugs and in-depth understanding of the biological process in a cell.

摘要

背景/目的：蛋白质亚细胞定位的知识对于基础研究和药物开发都至关重要。面对后基因组时代涌现的大量蛋白质序列，迫切需要开发计算工具，以便仅基于序列信息及时、有效地识别它们的亚细胞定位。最近，开发了一种名为“pLoc-mVirus”的预测器，用于识别病毒蛋白的亚细胞定位。对于相同目的，其性能比其他预测器要好得多，特别是在处理多标签系统时，其中一些蛋白质，即“多重蛋白”，可能同时出现在两个或更多亚细胞定位位点，或在这些位点之间移动。尽管它确实是一个非常强大的预测器，但仍肯定需要更多努力来进一步改进它。这是因为pLoc-mVirus是由一个极度不均衡的数据集训练的，其中一些子集的大小是其他子集的10倍以上。因此，它无法避免由这种不均衡训练数据集导致的偏差后果。