使用多模态生化语言模型从靶蛋白序列生成具有所需效力的化合物。

Generative design of compounds with desired potency from target protein sequences using a multimodal biochemical language model.

作者信息

Chen Hengwei, Bajorath Jürgen

机构信息

Department of Life Science Informatics and Data Science, B-IT, Lamarr Institute for Machine Learning and Artificial Intelligence, LIMES Program Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 5/6, 53115, Bonn, Germany.

出版信息

J Cheminform. 2024 May 22;16(1):55. doi: 10.1186/s13321-024-00852-x.

DOI:10.1186/s13321-024-00852-x

PMID:38778425

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11110441/

Abstract

Deep learning models adapted from natural language processing offer new opportunities for the prediction of active compounds via machine translation of sequential molecular data representations. For example, chemical language models are often derived for compound string transformation. Moreover, given the principal versatility of language models for translating different types of textual representations, off-the-beaten-path design tasks might be explored. In this work, we have investigated generative design of active compounds with desired potency from target sequence embeddings, representing a rather provoking prediction task. Therefore, a dual-component conditional language model was designed for learning from multimodal data. It comprised a protein language model component for generating target sequence embeddings and a conditional transformer for predicting new active compounds with desired potency. To this end, the designated "biochemical" language model was trained to learn mappings of combined protein sequence and compound potency value embeddings to corresponding compounds, fine-tuned on individual activity classes not encountered during model derivation, and evaluated on compound test sets that were structurally distinct from training sets. The biochemical language model correctly reproduced known compounds with different potency for all activity classes, providing proof-of-concept for the approach. Furthermore, the conditional model consistently reproduced larger numbers of known compounds as well as more potent compounds than an unconditional model, revealing a substantial effect of potency conditioning. The biochemical language model also generated structurally diverse candidate compounds departing from both fine-tuning and test compounds. Overall, generative compound design based on potency value-conditioned target sequence embeddings yielded promising results, rendering the approach attractive for further exploration and practical applications. SCIENTIFIC CONTRIBUTION: The approach introduced herein combines protein language model and chemical language model components, representing an advanced architecture, and is the first methodology for predicting compounds with desired potency from conditioned protein sequence data.

摘要

从自然语言处理改编而来的深度学习模型，为通过顺序分子数据表示的机器翻译来预测活性化合物提供了新机会。例如，化学语言模型通常用于化合物字符串转换。此外，鉴于语言模型在翻译不同类型文本表示方面的主要通用性，可以探索一些不寻常的设计任务。在这项工作中，我们研究了从目标序列嵌入生成具有所需效力的活性化合物，这是一项颇具挑战性的预测任务。因此，设计了一种双组件条件语言模型，用于从多模态数据中学习。它包括一个用于生成目标序列嵌入的蛋白质语言模型组件和一个用于预测具有所需效力的新活性化合物的条件变换器。为此，对指定的“生化”语言模型进行训练，以学习组合蛋白质序列和化合物效力值嵌入到相应化合物的映射，在模型推导过程中未遇到的单个活性类别上进行微调，并在结构上与训练集不同的化合物测试集上进行评估。生化语言模型正确地再现了所有活性类别的具有不同效力的已知化合物，为该方法提供了概念验证。此外，与无条件模型相比，条件模型始终能再现更多数量的已知化合物以及更有效的化合物，这揭示了效力条件的显著效果。生化语言模型还生成了与微调化合物和测试化合物都不同的结构多样的候选化合物。总体而言，基于效力值条件化目标序列嵌入的生成式化合物设计产生了有希望的结果，使该方法对进一步探索和实际应用具有吸引力。科学贡献：本文介绍的方法结合了蛋白质语言模型和化学语言模型组件，代表了一种先进的架构，并且是第一种从条件化蛋白质序列数据预测具有所需效力的化合物的方法。