Giri Swagarika J, Pandey Udayan, Park Joon Hong, Kihara Daisuke
Department of Computer Science, Purdue University, West Lafayette, IN, USA.
Department of Biological Sciences, Purdue University, West Lafayette, IN, USA.
Methods Mol Biol. 2025;2941:85-99. doi: 10.1007/978-1-0716-4623-6_5.
Understanding the functions of proteins is one of the most important challenges in modern biology. Typically, protein function prediction methods generate a list of gene ontology (GO) terms, sometimes consisting of 50-100 functional terms. While GO serves the purpose of standardizing terms, interpreting a long list of GO terms is difficult for biologists. To address this challenge, we developed Gene Ontology terms Summarizer (GO2Sum), a language model-based summarizer that takes a list of GO terms as input and converts them into a concise, free-text summary describing the protein's function, subunit structure, and pathway information. GO2Sum was fine-tuned on GO term assignments and free-text function descriptions from UniProt entries. We built a Web server of GO2Sum, which offers an easy use of GO2Sum for biology users.
理解蛋白质的功能是现代生物学中最重要的挑战之一。通常,蛋白质功能预测方法会生成一系列基因本体(GO)术语,有时包含50 - 100个功能术语。虽然GO有助于实现术语标准化,但对于生物学家来说,解释一长串GO术语却很困难。为应对这一挑战,我们开发了基因本体术语摘要器(GO2Sum),这是一种基于语言模型的摘要器,它将一系列GO术语作为输入,并将其转换为简洁的自由文本摘要,描述蛋白质的功能、亚基结构和通路信息。GO2Sum在来自UniProt条目的GO术语分配和自由文本功能描述上进行了微调。我们构建了一个GO2Sum网络服务器,为生物学用户提供了便捷使用GO2Sum的途径。