Suppr超能文献

用于预测人血浆中游离分数的无描述符深度学习定量构效关系模型

Descriptor-Free Deep Learning QSAR Model for the Fraction Unbound in Human Plasma.

作者信息

Riedl Michael, Mukherjee Sayak, Gauthier Mitch

机构信息

Battelle, Columbus, Ohio 43201, United States.

出版信息

Mol Pharm. 2023 Oct 2;20(10):4984-4993. doi: 10.1021/acs.molpharmaceut.3c00129. Epub 2023 Sep 1.

Abstract

Chemical-specific parameters are either measured in vitro or estimated using quantitative structure-activity relationship (QSAR) models. The existing body of QSAR work relies on extracting a set of descriptors or fingerprints, subset selection, and training a machine learning model. In this work, we used a state-of-the-art natural language processing model, Bidirectional Encoder Representations from Transformers, which allowed us to circumvent the need for calculation of these chemical descriptors. In this approach, simplified molecular-input line-entry system (SMILES) strings were embedded in a high-dimensional space using a two-stage training approach. The model was first pre-trained on a masked SMILES token task and then fine-tuned on a QSAR prediction task. The pre-training task learned meaningful high-dimensional embeddings based upon the relationships between the chemical tokens in the SMILES strings derived from the "in-stock" portion of the ZINC 15 dataset─a large dataset of commercially available chemicals. The fine-tuning task then perturbed the pre-trained embeddings to facilitate prediction of a specific QSAR endpoint of interest. The power of this model stems from the ability to reuse the pre-trained model for multiple different fine-tuning tasks, reducing the computational burden of developing multiple models for different endpoints. We used our framework to develop a predictive model for fraction unbound in human plasma (). This approach is flexible, requires minimum domain expertise, and can be generalized for other parameters of interest for rapid and accurate estimation of absorption, distribution, metabolism, excretion, and toxicity.

摘要

化学特异性参数要么在体外进行测量,要么使用定量构效关系(QSAR)模型进行估算。现有的QSAR工作主要依赖于提取一组描述符或指纹、子集选择以及训练机器学习模型。在这项工作中,我们使用了一种先进的自然语言处理模型——来自Transformer的双向编码器表征(BERT),这使我们无需计算这些化学描述符。在这种方法中,使用两阶段训练方法将简化分子输入线性输入系统(SMILES)字符串嵌入到高维空间中。该模型首先在掩码SMILES令牌任务上进行预训练,然后在QSAR预测任务上进行微调。预训练任务基于从ZINC 15数据集(一个大型商用化学品数据集)的“现货”部分派生的SMILES字符串中的化学令牌之间的关系,学习有意义的高维嵌入。然后,微调任务对预训练的嵌入进行扰动,以促进对特定QSAR感兴趣终点的预测。该模型的强大之处在于能够将预训练模型用于多个不同的微调任务,从而减轻为不同终点开发多个模型的计算负担。我们使用我们的框架开发了一种预测人血浆中未结合分数的模型。这种方法灵活,需要的领域专业知识最少,并且可以推广到其他感兴趣的参数,以便快速准确地估计吸收、分布、代谢、排泄和毒性。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验