Suppr超能文献

糖基位点挖掘工具(GlycoSiteMiner):一种基于机器学习/人工智能辅助文献挖掘的流程,用于从PubMed摘要中提取糖基化位点。

GlycoSiteMiner: an ML/AI-assisted literature mining-based pipeline for extracting glycosylation sites from PubMed abstracts.

作者信息

Kahsay Robel, Bhuiyan Urnisha, Au Cyrus Chun Hong, Edwards Nathan, Johnson Luke, Kulkarni Sujeet, Martinez Karina, Ranzinger Rene, Vijay-Shanker K, Vora Jeet, Warner Kate, Tiemeyer Michael, Mazumder Raja

机构信息

Department of Biochemistry & Molecular Medicine, The George Washington School of Medicine and Health Sciences, 2300 Eye Street NW, Washington, DC 20052, United States.

Department of Biochemistry and Molecular & Cellular Biology, 37th and O Street NW. Georgetown University, Washington, DC 20007, United States.

出版信息

Glycobiology. 2025 Jun 2;35(7). doi: 10.1093/glycob/cwaf030.

Abstract

Over 50% of human proteins are estimated to be glycosylated, making glycosylation one of the most common post-translational modifications (PTMs) of proteins. A glycoinformatics resource such as the GlyGen knowledgebase, consisting of experimentally verified sequence-specific glycosylation sites, is critical for advancing research in glycobiology. Unfortunately, most experimental studies report glycosylation sites in free text format in scientific literature, mentioning gene names and amino acid positions without providing protein sequence identifiers, making it difficult to mine reported sites that can be mapped onto specific protein sequences. We have developed GlycoSiteMiner, which is an automated literature mining-based pipeline that extracts experimentally verified protein sequence-specific glycosylation sites from PubMed abstracts. The pipeline employs ML/AI algorithms to filter out incorrectly identified sites and has been applied to 33 million PubMed abstracts, identifying 1118 new sequence-specific glycosylation sites that were not previously present in the GlyGen resource.

摘要

据估计,超过50%的人类蛋白质会发生糖基化,这使得糖基化成为蛋白质最常见的翻译后修饰(PTM)之一。像GlyGen知识库这样的糖信息学资源,包含经过实验验证的序列特异性糖基化位点,对于推动糖生物学研究至关重要。不幸的是,大多数实验研究在科学文献中以自由文本格式报告糖基化位点,只提及基因名称和氨基酸位置,而不提供蛋白质序列标识符,这使得挖掘那些可以映射到特定蛋白质序列上的已报告位点变得困难。我们开发了GlycoSiteMiner,这是一个基于文献自动挖掘的流程,可从PubMed摘要中提取经过实验验证的蛋白质序列特异性糖基化位点。该流程采用机器学习/人工智能算法来筛选出错误识别的位点,并已应用于3300万篇PubMed摘要,识别出1118个新的序列特异性糖基化位点,这些位点以前并不存在于GlyGen资源中。

相似文献

本文引用的文献

2
Worldwide Glycoscience Informatics Infrastructure: The GlySpace Alliance.全球糖科学信息基础设施:糖空间联盟
JACS Au. 2022 Dec 2;3(1):4-12. doi: 10.1021/jacsau.2c00477. eCollection 2023 Jan 23.
3
UniProt: the Universal Protein Knowledgebase in 2023.UniProt:2023 年的通用蛋白质知识库。
Nucleic Acids Res. 2023 Jan 6;51(D1):D523-D531. doi: 10.1093/nar/gkac1052.
9
GlyGen data model and processing workflow.GlyGen 数据模型和处理工作流程。
Bioinformatics. 2020 Jun 1;36(12):3941-3943. doi: 10.1093/bioinformatics/btaa238.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验