Thakur Anamika, Rajput Akanksha, Kumar Manoj
Bioinformatics Centre, Institute of Microbial Technology, Council of Scientific and Industrial Research, Sector 39-A, Chandigarh-160036, India.
Mol Biosyst. 2016 Jul 19;12(8):2572-86. doi: 10.1039/c6mb00241b.
Knowledge of the subcellular location (SCL) of viral proteins in the host cell is important for understanding their function in depth. Therefore, we have developed "MSLVP", a two-tier prediction algorithm for predicting multiple SCLs of viral proteins. For this study, data sets of comprehensive viral proteins with experimentally validated SCL annotation were collected from UniProt. Non-redundant (90%) data sets of 3480 viral proteins that belonged to single (2715), double (391) and multiple (374) sites were employed. Additionally, 1687 (30% sequence identity) viral proteins were categorised into single (1366), double (167) and multiple (154) sites. Single, double and multiple locations further comprised of eight, four and six categories, respectively. Viral protein locations include the nucleus, cytoplasm, endoplasmic reticulum, extracellular, single-pass membrane, multi-pass membrane, capsid, remaining others and combinations thereof. Support vector machine based models were developed using sequence features like amino acid composition, dipeptide composition, physicochemical properties and their hybrids. We have employed "one-versus-one" as well as "one-versus-other" strategies for multiclass classification. The performance of "one-versus-one" is better than the "one-versus-other" approach during 10-fold cross-validation. For the 90% data set, we achieved an accuracy, a Matthew's correlation coefficient (MCC) and a receiver operating characteristic (ROC) of 99.99%, 1.00, 1.00; 100.00%, 1.00, 1.00 and 99.90%; 1.00, 1.00 for single, double and multiple locations, respectively. Similar results were achieved for a 30% sequence identity data set. Predictive models for each SCL performed equally well on the independent dataset. The MSLVP web server () can predict subcellular locations i.e. single (8; including single and multi-pass membrane), double (4) and multiple (6). This would be helpful for elucidating the functional annotation of viral proteins and potential drug targets.
了解病毒蛋白在宿主细胞中的亚细胞定位(SCL)对于深入理解其功能至关重要。因此,我们开发了“MSLVP”,一种用于预测病毒蛋白多个SCL的两层预测算法。在本研究中,从UniProt收集了具有经实验验证的SCL注释的综合病毒蛋白数据集。使用了属于单一位点(2715个)、双位点(391个)和多位点(374个)的3480个病毒蛋白的非冗余(90%)数据集。此外,1687个(序列同一性为30%)病毒蛋白被分类为单一位点(1366个)、双位点(167个)和多位点(154个)。单一位点、双位点和多位点进一步分别由八类、四类和六类组成。病毒蛋白定位包括细胞核、细胞质、内质网、细胞外、单次跨膜、多次跨膜、衣壳、其余其他部位及其组合。使用氨基酸组成、二肽组成、理化性质及其混合等序列特征开发了基于支持向量机的模型。我们采用了“一对一”以及“一对其他”策略进行多类分类。在10倍交叉验证期间,“一对一”的性能优于“一对其他”方法。对于90%的数据集,我们在单一位点、双位点和多位点上分别实现了99.99%、1.00、1.00的准确率、马修斯相关系数(MCC)和受试者工作特征(ROC);100.00%、1.00、1.00以及99.90%、1.00、1.00。对于30%序列同一性的数据集也获得了类似结果。每个SCL的预测模型在独立数据集上表现同样良好。MSLVP网络服务器()可以预测亚细胞定位,即单一位点(8种;包括单次和多次跨膜)、双位点(4种)和多位点(6种)。这将有助于阐明病毒蛋白的功能注释和潜在药物靶点。