Giri Swagarika Jaharlal, Ibtehaz Nabil, Kihara Daisuke
Department of Computer Science, Purdue University, West Lafayette, IN, United States.
Department of Biological Sciences, Purdue University, West Lafayette, IN, United States.
bioRxiv. 2023 Nov 15:2023.11.10.566665. doi: 10.1101/2023.11.10.566665.
Understanding the biological functions of proteins is of fundamental importance in modern biology. To represent function of proteins, Gene Ontology (GO), a controlled vocabulary, is frequently used, because it is easy to handle by computer programs avoiding open-ended text interpretation. Particularly, the majority of current protein function prediction methods rely on GO terms. However, the extensive list of GO terms that describe a protein function can pose challenges for biologists when it comes to interpretation. In response to this issue, we developed GO2Sum (Gene Ontology terms Summarizer), a model that takes a set of GO terms as input and generates a human-readable summary using the T5 large language model. GO2Sum was developed by fine-tuning T5 on GO term assignments and free-text function descriptions for UniProt entries, enabling it to recreate function descriptions by concatenating GO term descriptions. Our results demonstrated that GO2Sum significantly outperforms the original T5 model that was trained on the entire web corpus in generating Function, Subunit Structure, and Pathway paragraphs for UniProt entries.
了解蛋白质的生物学功能在现代生物学中至关重要。为了表示蛋白质的功能,基因本体论(GO),一种受控词汇表,经常被使用,因为它易于计算机程序处理,避免了开放式文本解释。特别是,当前大多数蛋白质功能预测方法都依赖于GO术语。然而,描述蛋白质功能的大量GO术语列表在解释方面可能给生物学家带来挑战。针对这个问题,我们开发了GO2Sum(基因本体论术语汇总器),这是一个以一组GO术语为输入,并使用T5大语言模型生成人类可读摘要的模型。GO2Sum是通过在UniProt条目的GO术语分配和自由文本功能描述上对T5进行微调而开发的,使其能够通过连接GO术语描述来重新创建功能描述。我们的结果表明,在为UniProt条目生成功能、亚基结构和途径段落方面,GO2Sum明显优于在整个网络语料库上训练的原始T5模型。