TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology, Garching/Munich, Germany.
TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Garching/Munich, Germany.
Curr Protoc. 2021 May;1(5):e113. doi: 10.1002/cpz1.113.
Models from machine learning (ML) or artificial intelligence (AI) increasingly assist in guiding experimental design and decision making in molecular biology and medicine. Recently, Language Models (LMs) have been adapted from Natural Language Processing (NLP) to encode the implicit language written in protein sequences. Protein LMs show enormous potential in generating descriptive representations (embeddings) for proteins from just their sequences, in a fraction of the time with respect to previous approaches, yet with comparable or improved predictive ability. Researchers have trained a variety of protein LMs that are likely to illuminate different angles of the protein language. By leveraging the bio_embeddings pipeline and modules, simple and reproducible workflows can be laid out to generate protein embeddings and rich visualizations. Embeddings can then be leveraged as input features through machine learning libraries to develop methods predicting particular aspects of protein function and structure. Beyond the workflows included here, embeddings have been leveraged as proxies to traditional homology-based inference and even to align similar protein sequences. A wealth of possibilities remain for researchers to harness through the tools provided in the following protocols. © 2021 The Authors. Current Protocols published by Wiley Periodicals LLC. The following protocols are included in this manuscript: Basic Protocol 1: Generic use of the bio_embeddings pipeline to plot protein sequences and annotations Basic Protocol 2: Generate embeddings from protein sequences using the bio_embeddings pipeline Basic Protocol 3: Overlay sequence annotations onto a protein space visualization Basic Protocol 4: Train a machine learning classifier on protein embeddings Alternate Protocol 1: Generate 3D instead of 2D visualizations Alternate Protocol 2: Visualize protein solubility instead of protein subcellular localization Support Protocol: Join embedding generation and sequence space visualization in a pipeline.
机器学习 (ML) 或人工智能 (AI) 的模型越来越多地用于指导分子生物学和医学中的实验设计和决策。最近,语言模型 (LM) 已经从自然语言处理 (NLP) 中进行了改编,以对蛋白质序列中的隐式语言进行编码。蛋白质 LM 仅通过其序列就可以为蛋白质生成描述性表示(嵌入),这在时间上相对于以前的方法大大缩短,而且具有可比或改进的预测能力。研究人员已经训练了各种蛋白质 LM,这些模型可能会阐明蛋白质语言的不同角度。通过利用 bio_embeddings 管道和模块,可以简单且可重复地制定工作流程来生成蛋白质嵌入和丰富的可视化效果。然后,可以将嵌入作为输入特征通过机器学习库来开发方法,以预测蛋白质功能和结构的特定方面。除了这里包含的工作流程之外,嵌入还被用作传统基于同源性推断的代理,甚至用于对齐相似的蛋白质序列。通过以下协议中提供的工具,研究人员仍然有大量的可能性可以利用。© 2021 作者。Wiley Periodicals LLC 出版的当前协议。本文包含以下协议:基本方案 1:使用 bio_embeddings 管道绘制蛋白质序列和注释的通用方法基本方案 2:使用 bio_embeddings 管道从蛋白质序列生成嵌入基本方案 3:将序列注释覆盖到蛋白质空间可视化中基本方案 4:在蛋白质嵌入上训练机器学习分类器备选方案 1:生成 2D 而不是 3D 可视化备选方案 2:可视化蛋白质可溶性而不是蛋白质亚细胞定位支持方案:在管道中加入嵌入生成和序列空间可视化。