利用进化和结构域信息对癌凝集素进行分析与预测

Analysis and prediction of cancerlectins using evolutionary and domain information.

作者信息

Kumar Ravi, Panwar Bharat, Chauhan Jagat S, Raghava Gajendra Ps

机构信息

Bioinformatics Centre Institute of Microbial Technology, Sector-39A, Chandigarh, India.

出版信息

BMC Res Notes. 2011 Jul 20;4:237. doi: 10.1186/1756-0500-4-237.

DOI:10.1186/1756-0500-4-237

PMID:21774797

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3161874/

Abstract

BACKGROUND

Predicting the function of a protein is one of the major challenges in the post-genomic era where a large number of protein sequences of unknown function are accumulating rapidly. Lectins are the proteins that specifically recognize and bind to carbohydrate moieties present on either proteins or lipids. Cancerlectins are those lectins that play various important roles in tumor cell differentiation and metastasis. Although the two types of proteins are linked, still there is no computational method available that can distinguish cancerlectins from the large pool of non-cancerlectins. Hence, it is imperative to develop a method that can distinguish between cancer and non-cancerlectins.

RESULTS

All the models developed in this study are based on a non-redundant dataset containing 178 cancerlectins and 226 non-cancerlectins in which no two sequences have more than 50% sequence similarity. We have applied the similarity search based technique, i.e. BLAST, and achieved a maximum accuracy of 43.25%. The amino acids compositional analysis have shown that certain residues (e.g. Leucine, Proline) were preferred in cancerlectins whereas some other (e.g. Asparatic acid, Asparagine) were preferred in non-cancerlectins. It has been found that the PROSITE domain "Crystalline beta gamma" was abundant in cancerlectins whereas domains like "SUEL-type lectin domain" were found mainly in non-cancerlectins. An SVM-based model has been developed to differentiate between the cancer and non-cancerlectins which achieved a maximum Matthew's correlation coefficient (MCC) value of 0.32 with an accuracy of 64.84%, using amino acid compositions. We have developed a model based on dipeptide compositions which achieved an MCC value of 0.30 with an accuracy of 64.84%. Thereafter, we have developed models based on split compositions (2 and 4 parts) and achieved an MCC value of 0.31, 0.32 with accuracies of 65.10% and 66.09%, respectively. An SVM model based on Position Specific Scoring Matrix (PSSM), generated by PSI-BLAST, was developed and achieved an MCC value of 0.36 with an accuracy of 68.34%. Finally, we have integrated the PROSITE domain information with PSSM and developed an SVM model that has achieved an MCC value of 0.38 with 69.09% accuracy.

CONCLUSION

BLAST has been found inefficient to distinguish between cancer and non-cancerlectins. We analyzed the protein sequences of cancer and non-cancerlectins and identified interesting patterns. We have been able to identify PROSITE domains that are preferred in cancer and non-cancerlectins and thus provided interesting insights into the two types of proteins. The method developed in this study will be useful for researchers studying cancerlectins, lectins and cancer biology. The web-server based on the above study, is available at http://www.imtech.res.in/raghava/cancer_pred/

摘要

背景

在基因组时代，大量功能未知的蛋白质序列迅速积累，预测蛋白质功能是主要挑战之一。凝集素是一类能特异性识别并结合蛋白质或脂质上碳水化合物部分的蛋白质。癌凝集素是在肿瘤细胞分化和转移中发挥多种重要作用的凝集素。尽管这两类蛋白质存在关联，但目前尚无计算方法能够从大量非癌凝集素中区分出癌凝集素。因此，开发一种能区分癌凝集素和非癌凝集素的方法势在必行。

结果

本研究中开发的所有模型均基于一个非冗余数据集，该数据集包含178个癌凝集素和226个非癌凝集素，其中任意两条序列的序列相似性均不超过50%。我们应用了基于相似性搜索的技术，即BLAST，最高准确率达到43.25%。氨基酸组成分析表明，某些残基（如亮氨酸、脯氨酸）在癌凝集素中更常见，而其他一些残基（如天冬氨酸、天冬酰胺）在非癌凝集素中更常见。已发现PROSITE结构域“Crystalline beta gamma”在癌凝集素中丰富，而“SUEL型凝集素结构域”等结构域主要存在于非癌凝集素中。已开发出一种基于支持向量机（SVM）的模型来区分癌凝集素和非癌凝集素，使用氨基酸组成时，该模型的最大马修斯相关系数（MCC）值为0.32，准确率为64.84%。我们开发了一个基于二肽组成的模型，其MCC值为0.30，准确率为64.84%。此后，我们开发了基于分割组成（2部分和4部分）的模型，MCC值分别为0.31和0.32，准确率分别为65.10%和66.09%。开发了一个基于位置特异性评分矩阵（PSSM）的SVM模型，该模型由PSI-BLAST生成，MCC值为0.36，准确率为68.34%。最后，我们将PROSITE结构域信息与PSSM整合，开发了一个SVM模型，其MCC值为0.38，准确率为69.09%。

结论

已发现BLAST在区分癌凝集素和非癌凝集素方面效率低下。我们分析了癌凝集素和非癌凝集素的蛋白质序列并识别出有趣的模式。我们能够识别出在癌凝集素和非癌凝集素中更常见的PROSITE结构域，从而为这两类蛋白质提供了有趣的见解。本研究中开发的方法将对研究癌凝集素、凝集素和癌症生物学的研究人员有用。基于上述研究的网络服务器可在http://www.imtech.res.in/raghava/cancer_pred/获取

相似文献

Analysis and prediction of cancerlectins using evolutionary and domain information.利用进化和结构域信息对癌凝集素进行分析与预测

BMC Res Notes. 2011 Jul 20;4:237. doi: 10.1186/1756-0500-4-237.

Prediction and classification of aminoacyl tRNA synthetases using PROSITE domains.基于 PROSITE 结构域预测和分类氨酰-tRNA 合成酶。

BMC Genomics. 2010 Sep 22;11:507. doi: 10.1186/1471-2164-11-507.

SVM based prediction of RNA-binding proteins using binding residues and evolutionary information.基于支持向量机的 RNA 结合蛋白结合残基和进化信息预测。

J Mol Recognit. 2011 Mar-Apr;24(2):303-13. doi: 10.1002/jmr.1061.

Prediction of GTP interacting residues, dipeptides and tripeptides in a protein from its evolutionary information.从蛋白质的进化信息预测其 GTP 相互作用残基、二肽和三肽。

BMC Bioinformatics. 2010 Jun 3;11:301. doi: 10.1186/1471-2105-11-301.

Identification of NAD interacting residues in proteins.鉴定蛋白质中与 NAD 相互作用的残基。

BMC Bioinformatics. 2010 Mar 30;11:160. doi: 10.1186/1471-2105-11-160.

Sequence-based predictive modeling to identify cancerlectins.基于序列的预测建模以识别癌凝集素。

Oncotarget. 2017 Apr 25;8(17):28169-28175. doi: 10.18632/oncotarget.15963.

A Two-Step Feature Selection Method to Predict Cancerlectins by Multiview Features and Synthetic Minority Oversampling Technique.基于多视图特征和合成少数过采样技术的两步特征选择方法预测癌症凝集素。

Biomed Res Int. 2018 Feb 7;2018:9364182. doi: 10.1155/2018/9364182. eCollection 2018.

Identification of ATP binding residues of a protein from its primary sequence.从蛋白质的一级序列鉴定其 ATP 结合残基。

BMC Bioinformatics. 2009 Dec 19;10:434. doi: 10.1186/1471-2105-10-434.

Prediction of RNA binding sites in a protein using SVM and PSSM profile.使用支持向量机和位置特异性得分矩阵预测蛋白质中的RNA结合位点。

Proteins. 2008 Apr;71(1):189-94. doi: 10.1002/prot.21677.

Predicting sub-cellular localization of tRNA synthetases from their primary structures.从一级结构预测 tRNA 合成酶的亚细胞定位。

Amino Acids. 2012 May;42(5):1703-13. doi: 10.1007/s00726-011-0872-8. Epub 2011 Mar 13.

引用本文的文献

iAcety-SmRF: Identification of Acetylation Protein by Using Statistical Moments and Random Forest.iAcety-SmRF：利用统计矩和随机森林鉴定乙酰化蛋白

Membranes (Basel). 2022 Feb 25;12(3):265. doi: 10.3390/membranes12030265.

Identification of Cancerlectins Using Support Vector Machines With Fusion of G-Gap Dipeptide.使用融合G-间隙二肽的支持向量机鉴定癌凝集素。

Front Genet. 2020 Apr 3;11:275. doi: 10.3389/fgene.2020.00275. eCollection 2020.

Harnessing the evolutionary information on oxygen binding proteins through Support Vector Machines based modules.通过基于支持向量机的模块利用氧结合蛋白的进化信息。

BMC Res Notes. 2018 May 11;11(1):290. doi: 10.1186/s13104-018-3383-9.

Biomed Res Int. 2018 Feb 7;2018:9364182. doi: 10.1155/2018/9364182. eCollection 2018.

Sequence-based predictive modeling to identify cancerlectins.基于序列的预测建模以识别癌凝集素。

Oncotarget. 2017 Apr 25;8(17):28169-28175. doi: 10.18632/oncotarget.15963.

Accurate Identification of Cancerlectins through Hybrid Machine Learning Technology.通过混合机器学习技术准确识别癌凝集素。

Int J Genomics. 2016;2016:7604641. doi: 10.1155/2016/7604641. Epub 2016 Jul 13.

Predicting cancerlectins by the optimal g-gap dipeptides.通过最优g-间隙二肽预测癌凝集素

Sci Rep. 2015 Dec 9;5:16964. doi: 10.1038/srep16964.

Prediction of uridine modifications in tRNA sequences.预测 tRNA 序列中的尿嘧啶修饰。

BMC Bioinformatics. 2014 Oct 2;15(1):326. doi: 10.1186/1471-2105-15-326.

Support vector machine (SVM) based multiclass prediction with basic statistical analysis of plasminogen activators.基于支持向量机（SVM）的多类预测及纤溶酶原激活剂的基本统计分析

BMC Res Notes. 2014 Jan 27;7:63. doi: 10.1186/1756-0500-7-63.

Hybrid approach for predicting coreceptor used by HIV-1 from its V3 loop amino acid sequence.一种从 HIV-1 的 V3 环氨基酸序列预测其辅助受体的混合方法。

PLoS One. 2013 Apr 15;8(4):e61437. doi: 10.1371/journal.pone.0061437. Print 2013.

本文引用的文献

Lectin microarray.凝集素微阵列。

Proteomics Clin Appl. 2009 Feb;3(2):148-54. doi: 10.1002/prca.200800153.

Prediction of mitochondrial proteins of malaria parasite using split amino acid composition and PSSM profile.利用氨基酸组成拆分和 PSSM 图谱预测疟原虫的线粒体蛋白。

Amino Acids. 2010 Jun;39(1):101-10. doi: 10.1007/s00726-009-0381-1. Epub 2009 Nov 12.

Roles of galectins in infection.半乳糖凝集素在感染中的作用。

Nat Rev Microbiol. 2009 Jun;7(6):424-38. doi: 10.1038/nrmicro2146.

RSLpred: an integrative system for predicting subcellular localization of rice proteins combining compositional and evolutionary information.RSLpred：一种结合组成信息和进化信息预测水稻蛋白质亚细胞定位的综合系统。

Proteomics. 2009 May;9(9):2324-42. doi: 10.1002/pmic.200700597.

Prediction of cell wall lytic enzymes using Chou's amphiphilic pseudo amino acid composition.基于周氏两亲性伪氨基酸组成预测细胞壁裂解酶

Protein Pept Lett. 2009;16(4):351-5. doi: 10.2174/092986609787848045.

Prediction of nuclear proteins using SVM and HMM models.使用支持向量机和隐马尔可夫模型预测核蛋白。

BMC Bioinformatics. 2009 Jan 19;10:22. doi: 10.1186/1471-2105-10-22.

Prediction of protein secondary structure content by using the concept of Chou's pseudo amino acid composition and support vector machine.利用周氏伪氨基酸组成概念和支持向量机预测蛋白质二级结构含量

Protein Pept Lett. 2009;16(1):27-31. doi: 10.2174/092986609787049420.

In silico mutagenesis and docking studies of Pseudomonas aeruginosa PA-IIL lectin predicting binding modes and energies.铜绿假单胞菌PA-IIL凝集素的计算机诱变和对接研究：预测结合模式和能量

J Chem Inf Model. 2008 Nov;48(11):2234-42. doi: 10.1021/ci8002107.

Identification of proteins secreted by malaria parasite into erythrocyte using SVM and PSSM profiles.使用支持向量机和位置特异性得分矩阵概况鉴定疟原虫分泌到红细胞中的蛋白质。

BMC Bioinformatics. 2008 Apr 16;9:201. doi: 10.1186/1471-2105-9-201.

Characterisation and protein expression profiling of annexins in colorectal cancer.结直肠癌中膜联蛋白的特征及蛋白质表达谱分析

Br J Cancer. 2008 Jan 29;98(2):426-33. doi: 10.1038/sj.bjc.6604128. Epub 2007 Dec 11.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

利用进化和结构域信息对癌凝集素进行分析与预测

Analysis and prediction of cancerlectins using evolutionary and domain information.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSION

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献