用于药物发现中 ADMET 预测的混合片段 SMILES 标记化。

Hybrid fragment-SMILES tokenization for ADMET prediction in drug discovery.

机构信息

Department of Computer Science, Brock University, 1812 Sir Isaac Brock Way, St. Catharines, ON, L2S 3A1, Canada.

Digital Technologies Research Centre, National Research Council Canada, 1200 Montreal Road, Ottawa, ON, K1A 0R6, Canada.

出版信息

BMC Bioinformatics. 2024 Aug 1;25(1):255. doi: 10.1186/s12859-024-05861-z.

DOI:10.1186/s12859-024-05861-z

PMID:39090573

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11295479/

Abstract

BACKGROUND

Drug discovery and development is the extremely costly and time-consuming process of identifying new molecules that can interact with a biomarker target to interrupt the disease pathway of interest. In addition to binding the target, a drug candidate needs to satisfy multiple properties affecting absorption, distribution, metabolism, excretion, and toxicity (ADMET). Artificial intelligence approaches provide an opportunity to improve each step of the drug discovery and development process, in which the first question faced by us is how a molecule can be informatively represented such that the in-silico solutions are optimized.

RESULTS

This study introduces a novel hybrid SMILES-fragment tokenization method, coupled with two pre-training strategies, utilizing a Transformer-based model. We investigate the efficacy of hybrid tokenization in improving the performance of ADMET prediction tasks. Our approach leverages MTL-BERT, an encoder-only Transformer model that achieves state-of-the-art ADMET predictions, and contrasts the standard SMILES tokenization with our hybrid method across a spectrum of fragment library cutoffs.

CONCLUSION

The findings reveal that while an excess of fragments can impede performance, using hybrid tokenization with high frequency fragments enhances results beyond the base SMILES tokenization. This advancement underscores the potential of integrating fragment- and character-level molecular features within the training of Transformer models for ADMET property prediction.

摘要

背景

药物发现和开发是一个极其昂贵和耗时的过程，需要识别新的分子，这些分子可以与生物标志物靶标相互作用，从而中断感兴趣的疾病途径。除了与靶标结合外，候选药物还需要满足影响吸收、分布、代谢、排泄和毒性（ADMET）的多种特性。人工智能方法提供了一个改善药物发现和开发过程每个步骤的机会，在这个过程中，我们首先面临的问题是如何有意义地表示分子，以便优化计算解决方案。

结果

本研究介绍了一种新颖的 SMILES-片段标记混合方法，结合了两种预训练策略，利用基于 Transformer 的模型。我们研究了混合标记在改善 ADMET 预测任务性能方面的效果。我们的方法利用了 MTL-BERT，这是一种仅编码器的 Transformer 模型，在 ADMET 预测方面达到了最新水平，并在一系列片段库截止值上对比了标准 SMILES 标记化和我们的混合方法。

结论

研究结果表明，虽然过多的片段会影响性能，但使用具有高频片段的混合标记可以在基础 SMILES 标记化的基础上进一步提高结果。这一进展突显了在 ADMET 性质预测中，将片段和字符级分子特征集成到 Transformer 模型训练中的潜力。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

用于药物发现中 ADMET 预测的混合片段 SMILES 标记化。

Hybrid fragment-SMILES tokenization for ADMET prediction in drug discovery.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSION

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

用于药物发现中 ADMET 预测的混合片段 SMILES 标记化。

Hybrid fragment-SMILES tokenization for ADMET prediction in drug discovery.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSION

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献