Institute of Science and Technology for Brain-Inspired Intelligence and MOE Frontiers Center for Brain Science, Fudan University, Shanghai 200433, China.
Bioinformatics Center, Institute for Chemical Research, Kyoto University, Uji, Kyoto Prefecture 611-0011, Japan.
Bioinformatics. 2024 Sep 1;40(Suppl 2):ii53-ii61. doi: 10.1093/bioinformatics/btae401.
The vast majority of proteins still lack experimentally validated functional annotations, which highlights the importance of developing high-performance automated protein function prediction/annotation (AFP) methods. While existing approaches focus on protein sequences, networks, and structural data, textual information related to proteins has been overlooked. However, roughly 82% of SwissProt proteins already possess literature information that experts have annotated. To efficiently and effectively use literature information, we present GORetriever, a two-stage deep information retrieval-based method for AFP. Given a target protein, in the first stage, candidate Gene Ontology (GO) terms are retrieved by using annotated proteins with similar descriptions. In the second stage, the GO terms are reranked based on semantic matching between the GO definitions and textual information (literature and protein description) of the target protein. Extensive experiments over benchmark datasets demonstrate the remarkable effectiveness of GORetriever in enhancing the AFP performance. Note that GORetriever is the key component of GOCurator, which has achieved first place in the latest critical assessment of protein function annotation (CAFA5: over 1600 teams participated), held in 2023-2024.
GORetriever is publicly available at https://github.com/ZhuLab-Fudan/GORetriever.
绝大多数蛋白质仍然缺乏经过实验验证的功能注释,这凸显了开发高性能自动化蛋白质功能预测/注释(AFP)方法的重要性。虽然现有的方法主要关注蛋白质序列、网络和结构数据,但与蛋白质相关的文本信息却被忽视了。然而,大约 82%的 SwissProt 蛋白质已经拥有专家注释的文献信息。为了高效、有效地利用文献信息,我们提出了 GORetriever,这是一种基于深度信息检索的两阶段 AFP 方法。给定一个目标蛋白质,在第一阶段,通过使用具有相似描述的注释蛋白质来检索候选基因本体 (GO) 术语。在第二阶段,根据目标蛋白质的 GO 定义和文本信息(文献和蛋白质描述)之间的语义匹配对 GO 术语进行重新排序。在基准数据集上进行的广泛实验证明了 GORetriever 在增强 AFP 性能方面的显著效果。请注意,GORetriever 是 GOCurator 的关键组成部分,GOCurator 在 2023-2024 年举行的最新蛋白质功能注释关键评估(CAFA5:有超过 1600 个团队参加)中获得了第一名。
GORetriever 可在 https://github.com/ZhuLab-Fudan/GORetriever 上公开获取。