大规模预训练提高基于主动学习的虚拟筛选的样本效率。

Large-Scale Pretraining Improves Sample Efficiency of Active Learning-Based Virtual Screening.

机构信息

Medicinal Chemistry, Biogen, Cambridge, Massachusetts 02142, United States.

出版信息

J Chem Inf Model. 2024 Mar 25;64(6):1882-1891. doi: 10.1021/acs.jcim.3c01938. Epub 2024 Mar 5.

DOI:10.1021/acs.jcim.3c01938

PMID:38442000

Abstract

Virtual screening of large compound libraries to identify potential hit candidates is one of the earliest steps in drug discovery. As the size of commercially available compound collections grows exponentially to the scale of billions, active learning and Bayesian optimization have recently been proven as effective methods of narrowing down the search space. An essential component of those methods is a surrogate machine learning model that predicts the desired properties of compounds. An accurate model can achieve high sample efficiency by finding hits with only a fraction of the entire library being virtually screened. In this study, we examined the performance of a pretrained transformer-based language model and graph neural network in a Bayesian optimization active learning framework. The best pretrained model identifies 58.97% of the top-50,000 compounds after screening only 0.6% of an ultralarge library containing 99.5 million compounds, improving 8% over the previous state-of-the-art baseline. Through extensive benchmarks, we show that the superior performance of pretrained models persists in both structure-based and ligand-based drug discovery. Pretrained models can serve as a boost to the accuracy and sample efficiency of active learning-based virtual screening.

摘要

虚拟筛选大型化合物库以识别潜在的命中候选物是药物发现的最早步骤之一。随着商业上可用的化合物库的规模呈指数级增长到数十亿规模，主动学习和贝叶斯优化最近已被证明是缩小搜索空间的有效方法。这些方法的一个重要组成部分是一个替代机器学习模型，用于预测化合物的所需性质。通过仅对虚拟筛选的整个库的一小部分进行筛选，准确的模型可以实现高样本效率，从而找到命中。在这项研究中，我们在贝叶斯优化主动学习框架中检查了基于预训练的转换器的语言模型和图神经网络的性能。最佳的预训练模型在筛选仅包含 9950 万个化合物的超大型库的 0.6%后，可识别出前 50000 个化合物中的 58.97%，比以前的最先进基线提高了 8%。通过广泛的基准测试，我们表明，在基于结构和基于配体的药物发现中，预训练模型的优越性能仍然存在。预训练模型可以提高基于主动学习的虚拟筛选的准确性和样本效率。

相似文献

Large-Scale Pretraining Improves Sample Efficiency of Active Learning-Based Virtual Screening.大规模预训练提高基于主动学习的虚拟筛选的样本效率。

J Chem Inf Model. 2024 Mar 25;64(6):1882-1891. doi: 10.1021/acs.jcim.3c01938. Epub 2024 Mar 5.

Bayesian models trained with HTS data for predicting β-haematin inhibition and in vitro antimalarial activity.使用高通量筛选（HTS）数据训练的贝叶斯模型，用于预测β-血红素抑制和体外抗疟活性。

Bioorg Med Chem. 2015 Aug 15;23(16):5210-7. doi: 10.1016/j.bmc.2014.12.020. Epub 2014 Dec 20.

Concepts of Artificial Intelligence for Computer-Assisted Drug Discovery.人工智能在计算机辅助药物发现中的概念。

Chem Rev. 2019 Sep 25;119(18):10520-10594. doi: 10.1021/acs.chemrev.8b00728. Epub 2019 Jul 11.

Development of Ligand-based Big Data Deep Neural Network Models for Virtual Screening of Large Compound Libraries.基于配体的大数据深度神经网络模型在大型化合物库虚拟筛选中的开发。

Mol Inform. 2018 Nov;37(11):e1800031. doi: 10.1002/minf.201800031. Epub 2018 Jun 8.

Perspectives on current approaches to virtual screening in drug discovery.对当前药物发现中虚拟筛选方法的看法。

Expert Opin Drug Discov. 2024 Oct;19(10):1173-1183. doi: 10.1080/17460441.2024.2390511. Epub 2024 Aug 12.

Structure-based virtual screening of vast chemical space as a starting point for drug discovery.基于结构的虚拟筛选广阔的化学空间作为药物发现的起点。

Curr Opin Struct Biol. 2024 Aug;87:102829. doi: 10.1016/j.sbi.2024.102829. Epub 2024 Jun 6.

Combining computational methods for hit to lead optimization in Mycobacterium tuberculosis drug discovery.结合计算方法用于结核分枝杆菌药物发现中从活性分子到先导化合物的优化

Pharm Res. 2014 Feb;31(2):414-35. doi: 10.1007/s11095-013-1172-7. Epub 2013 Oct 17.

How to Prepare a Compound Collection Prior to Virtual Screening.如何在虚拟筛选之前准备化合物库。

Methods Mol Biol. 2019;1939:119-138. doi: 10.1007/978-1-4939-9089-4_7.

Thompson Sampling─An Efficient Method for Searching Ultralarge Synthesis on Demand Databases.Thompson 抽样─一种高效的按需搜索超大规模合成数据库的方法。

J Chem Inf Model. 2024 Feb 26;64(4):1158-1171. doi: 10.1021/acs.jcim.3c01790. Epub 2024 Feb 5.

Machine Learning-Boosted Docking Enables the Efficient Structure-Based Virtual Screening of Giga-Scale Enumerated Chemical Libraries.机器学习增强对接能够高效地对万亿级枚举化学库进行基于结构的虚拟筛选。

J Chem Inf Model. 2023 Sep 25;63(18):5773-5783. doi: 10.1021/acs.jcim.3c01239. Epub 2023 Sep 1.

引用本文的文献

A bottom-up approach to find lead compounds in expansive chemical spaces.一种在广阔化学空间中寻找先导化合物的自下而上方法。

Commun Chem. 2025 Aug 1;8(1):225. doi: 10.1038/s42004-025-01610-2.

Advancing active compound discovery for novel drug targets: insights from AI-driven approaches.推进针对新型药物靶点的活性化合物发现：人工智能驱动方法的见解。

Acta Pharmacol Sin. 2025 Jun 17. doi: 10.1038/s41401-025-01591-x.

Molecular property prediction using pretrained-BERT and Bayesian active learning: a data-efficient approach to drug design.使用预训练的BERT和贝叶斯主动学习进行分子性质预测：一种数据高效的药物设计方法。

J Cheminform. 2025 Apr 23;17(1):58. doi: 10.1186/s13321-025-00986-6.

Artificial intelligence in drug development.药物研发中的人工智能

Nat Med. 2025 Jan;31(1):45-59. doi: 10.1038/s41591-024-03434-4. Epub 2025 Jan 20.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

大规模预训练提高基于主动学习的虚拟筛选的样本效率。

Large-Scale Pretraining Improves Sample Efficiency of Active Learning-Based Virtual Screening.

机构信息

出版信息

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献