以自然语言处理（NLP）和大语言模型（LLM）为重点的生物功能预测方法综述。

A Survey of Biological Function Prediction Methods with Focus on Natural Language Processing (NLP) and Large Language Models (LLM).

作者信息

Varghese Dana Mary, Athulya T, Mohani Vikash K, Ahmad Shandar

机构信息

School of Computational and Integrative Sciences, Jawaharlal Nehru University, New Delhi, India.

出版信息

Methods Mol Biol. 2025;2941:201-225. doi: 10.1007/978-1-0716-4623-6_13.

DOI:10.1007/978-1-0716-4623-6_13

PMID:40601260

Abstract

Protein function prediction from sequence, structure, gene expression profiles, and published literature are needed to understand all biological processes. Natural language processing of biological text and large language model (LLM)-based encoding of sequence and structure opens powerful paths to rapid function annotation and novel training models. In this survey, we take a look at the available models for function prediction, especially the NLP- and LLM-based models. The survey highlights the major advances made and the ground that still needs to be covered to automate the process of function prediction from two major sources namely protein sequences and published research documents.

摘要

为了理解所有生物过程，需要从序列、结构、基因表达谱和已发表文献中预测蛋白质功能。对生物文本进行自然语言处理以及基于大语言模型（LLM）对序列和结构进行编码，为快速功能注释和新型训练模型开辟了强大的途径。在本次综述中，我们审视了现有的功能预测模型，尤其是基于自然语言处理和大语言模型的模型。该综述突出了已取得的主要进展以及在从蛋白质序列和已发表研究文献这两个主要来源实现功能预测过程自动化方面仍需涵盖的领域。