Institute for Biological Interfaces 5 (Institut für Biologische Grenzflächen IBG 5), Karlsruhe Institute of Technology (KIT), 76344 Eggenstein-Leopoldshafen, Germany.
Wellcome Trust Sanger Institute, Hinxton, Saffron Walden CB10 1RQ, United Kingdom.
FEMS Microbiol Rev. 2023 Jan 16;47(1). doi: 10.1093/femsre/fuad003.
Annotating protein sequences according to their biological functions is one of the key steps in understanding microbial diversity, metabolic potentials, and evolutionary histories. However, even in the best-studied prokaryotic genomes, not all proteins can be characterized by classical in vivo, in vitro, and/or in silico methods-a challenge rapidly growing alongside the advent of next-generation sequencing technologies and their enormous extension of 'omics' data in public databases. These so-called hypothetical proteins (HPs) represent a huge knowledge gap and hidden potential for biotechnological applications. Opportunities for leveraging the available 'Big Data' have recently proliferated with the use of artificial intelligence (AI). Here, we review the aims and methods of protein annotation and explain the different principles behind machine and deep learning algorithms including recent research examples, in order to assist both biologists wishing to apply AI tools in developing comprehensive genome annotations and computer scientists who want to contribute to this leading edge of biological research.
根据蛋白质的生物学功能对其进行注释是理解微生物多样性、代谢潜能和进化历史的关键步骤之一。然而,即使在研究得最好的原核基因组中,也并非所有蛋白质都可以通过经典的体内、体外和/或计算方法来进行特征描述——这一挑战随着下一代测序技术的出现以及公共数据库中“组学”数据的大量扩展而迅速加剧。这些所谓的假设蛋白质(HP)代表了巨大的知识空白和生物技术应用的潜在可能性。最近,随着人工智能(AI)的使用,利用可用“大数据”的机会迅速增加。在这里,我们回顾了蛋白质注释的目标和方法,并解释了机器学习和深度学习算法背后的不同原理,包括最近的研究实例,以帮助希望在开发全面基因组注释中应用 AI 工具的生物学家以及希望为这一生物学研究前沿做出贡献的计算机科学家。