Kumar Ravi, Panwar Bharat, Chauhan Jagat S, Raghava Gajendra Ps
Bioinformatics Centre Institute of Microbial Technology, Sector-39A, Chandigarh, India.
BMC Res Notes. 2011 Jul 20;4:237. doi: 10.1186/1756-0500-4-237.
Predicting the function of a protein is one of the major challenges in the post-genomic era where a large number of protein sequences of unknown function are accumulating rapidly. Lectins are the proteins that specifically recognize and bind to carbohydrate moieties present on either proteins or lipids. Cancerlectins are those lectins that play various important roles in tumor cell differentiation and metastasis. Although the two types of proteins are linked, still there is no computational method available that can distinguish cancerlectins from the large pool of non-cancerlectins. Hence, it is imperative to develop a method that can distinguish between cancer and non-cancerlectins.
All the models developed in this study are based on a non-redundant dataset containing 178 cancerlectins and 226 non-cancerlectins in which no two sequences have more than 50% sequence similarity. We have applied the similarity search based technique, i.e. BLAST, and achieved a maximum accuracy of 43.25%. The amino acids compositional analysis have shown that certain residues (e.g. Leucine, Proline) were preferred in cancerlectins whereas some other (e.g. Asparatic acid, Asparagine) were preferred in non-cancerlectins. It has been found that the PROSITE domain "Crystalline beta gamma" was abundant in cancerlectins whereas domains like "SUEL-type lectin domain" were found mainly in non-cancerlectins. An SVM-based model has been developed to differentiate between the cancer and non-cancerlectins which achieved a maximum Matthew's correlation coefficient (MCC) value of 0.32 with an accuracy of 64.84%, using amino acid compositions. We have developed a model based on dipeptide compositions which achieved an MCC value of 0.30 with an accuracy of 64.84%. Thereafter, we have developed models based on split compositions (2 and 4 parts) and achieved an MCC value of 0.31, 0.32 with accuracies of 65.10% and 66.09%, respectively. An SVM model based on Position Specific Scoring Matrix (PSSM), generated by PSI-BLAST, was developed and achieved an MCC value of 0.36 with an accuracy of 68.34%. Finally, we have integrated the PROSITE domain information with PSSM and developed an SVM model that has achieved an MCC value of 0.38 with 69.09% accuracy.
BLAST has been found inefficient to distinguish between cancer and non-cancerlectins. We analyzed the protein sequences of cancer and non-cancerlectins and identified interesting patterns. We have been able to identify PROSITE domains that are preferred in cancer and non-cancerlectins and thus provided interesting insights into the two types of proteins. The method developed in this study will be useful for researchers studying cancerlectins, lectins and cancer biology. The web-server based on the above study, is available at http://www.imtech.res.in/raghava/cancer_pred/
在基因组时代,大量功能未知的蛋白质序列迅速积累,预测蛋白质功能是主要挑战之一。凝集素是一类能特异性识别并结合蛋白质或脂质上碳水化合物部分的蛋白质。癌凝集素是在肿瘤细胞分化和转移中发挥多种重要作用的凝集素。尽管这两类蛋白质存在关联,但目前尚无计算方法能够从大量非癌凝集素中区分出癌凝集素。因此,开发一种能区分癌凝集素和非癌凝集素的方法势在必行。
本研究中开发的所有模型均基于一个非冗余数据集,该数据集包含178个癌凝集素和226个非癌凝集素,其中任意两条序列的序列相似性均不超过50%。我们应用了基于相似性搜索的技术,即BLAST,最高准确率达到43.25%。氨基酸组成分析表明,某些残基(如亮氨酸、脯氨酸)在癌凝集素中更常见,而其他一些残基(如天冬氨酸、天冬酰胺)在非癌凝集素中更常见。已发现PROSITE结构域“Crystalline beta gamma”在癌凝集素中丰富,而“SUEL型凝集素结构域”等结构域主要存在于非癌凝集素中。已开发出一种基于支持向量机(SVM)的模型来区分癌凝集素和非癌凝集素,使用氨基酸组成时,该模型的最大马修斯相关系数(MCC)值为0.32,准确率为64.84%。我们开发了一个基于二肽组成的模型,其MCC值为0.30,准确率为64.84%。此后,我们开发了基于分割组成(2部分和4部分)的模型,MCC值分别为0.31和0.32,准确率分别为65.10%和66.09%。开发了一个基于位置特异性评分矩阵(PSSM)的SVM模型,该模型由PSI-BLAST生成,MCC值为0.36,准确率为68.34%。最后,我们将PROSITE结构域信息与PSSM整合,开发了一个SVM模型,其MCC值为0.38,准确率为69.09%。
已发现BLAST在区分癌凝集素和非癌凝集素方面效率低下。我们分析了癌凝集素和非癌凝集素的蛋白质序列并识别出有趣的模式。我们能够识别出在癌凝集素和非癌凝集素中更常见的PROSITE结构域,从而为这两类蛋白质提供了有趣的见解。本研究中开发的方法将对研究癌凝集素、凝集素和癌症生物学的研究人员有用。基于上述研究的网络服务器可在http://www.imtech.res.in/raghava/cancer_pred/获取