Suppr超能文献

一种用于鉴定含有 C 型凝集素结构域(CTLD)的蛋白质的机器学习方法。

A Machine Learning Approach to Identify C Type Lectin Domain (CTLD) Containing Proteins.

机构信息

Department of Biotechnology, Panjab University, Sector-25, Chandigarh, 160014, India.

University Institute of Engineering & Technology, Panjab University, Sector-25, Chandigarh, 160014, India.

出版信息

Protein J. 2024 Aug;43(4):718-725. doi: 10.1007/s10930-024-10224-x. Epub 2024 Jul 28.

Abstract

Lectins are sugar interacting proteins which bind specific glycans reversibly and have ubiquitous presence in all forms of life. They have diverse biological functions such as cell signaling, molecular recognition, etc. C-type lectins (CTL) are a group of proteins from the lectin family which have been studied extensively in animals and are reported to be involved in immune functions, carcinogenesis, cell signaling, etc. The carbohydrate recognition domain (CRD) in CTL has a highly variable protein sequence and proteins carrying this domain are also referred to as C-type lectin domain containing proteins (CTLD). Because of this low sequence homology, identification of CTLD from hypothetical proteins in the sequenced genomes using homology based programs has limitations. Machine learning (ML) tools use characteristic features to identify homologous sequences and it has been used to develop a tool for identification of CTLD. Initially 500 sequences of well annotated CTLD and 500 sequences of non CTLD were used in developing the machine learning model. The classifier program Linear SVC from sci kit library of python was used and characteristic features in CTLD sequences like dipeptide and tripeptide composition were used as training attributes in various classifiers. A precision, recall and multiple correlation coefficient (MCC) value of 0.92, 0.91 and 0.82 respectively were obtained when tested on external test set. On fine tuning of the parameters like kernel, C value, gamma, degree and increasing number of non CTLD sequences there was improvement in precision, recall and MCC and the corresponding values were 0.99, 0.99 and 0.96. New CTLD have also been identified in the hypothetical segment of human genome using the trained model. The tool is available on our local server for interested users.

摘要

凝集素是一种与糖相互作用的蛋白质,能可逆地结合特定的糖,并广泛存在于所有生命形式中。它们具有多种生物学功能,如细胞信号转导、分子识别等。C 型凝集素(CTL)是凝集素家族中的一组蛋白质,在动物中得到了广泛的研究,据报道它们参与免疫功能、致癌作用、细胞信号转导等。CTL 中的碳水化合物识别结构域(CRD)具有高度可变的蛋白质序列,携带该结构域的蛋白质也被称为 C 型凝集素结构域包含蛋白(CTLD)。由于这种低序列同源性,使用基于同源性的程序从测序基因组中的假设蛋白中识别 CTLD 存在局限性。机器学习(ML)工具使用特征来识别同源序列,并已被用于开发识别 CTLD 的工具。最初,使用 500 个经过良好注释的 CTLD 序列和 500 个非 CTLD 序列来开发机器学习模型。使用了来自 Python 的 sci kit 库的线性 SVC 分类器程序,并将 CTLD 序列中的特征,如二肽和三肽组成,用作各种分类器的训练属性。当在外部测试集上进行测试时,获得了 0.92、0.91 和 0.82 的精度、召回率和多重相关系数(MCC)值。通过调整核、C 值、伽马值、度和增加非 CTLD 序列的数量等参数进行微调,精度、召回率和 MCC 都得到了提高,相应的值分别为 0.99、0.99 和 0.96。还使用训练模型在人类基因组的假设片段中识别了新的 CTLD。有兴趣的用户可以在我们的本地服务器上使用该工具。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验