Medicinal Chemistry, Biogen, Cambridge, Massachusetts 02142, United States.
J Chem Inf Model. 2024 Mar 25;64(6):1882-1891. doi: 10.1021/acs.jcim.3c01938. Epub 2024 Mar 5.
Virtual screening of large compound libraries to identify potential hit candidates is one of the earliest steps in drug discovery. As the size of commercially available compound collections grows exponentially to the scale of billions, active learning and Bayesian optimization have recently been proven as effective methods of narrowing down the search space. An essential component of those methods is a surrogate machine learning model that predicts the desired properties of compounds. An accurate model can achieve high sample efficiency by finding hits with only a fraction of the entire library being virtually screened. In this study, we examined the performance of a pretrained transformer-based language model and graph neural network in a Bayesian optimization active learning framework. The best pretrained model identifies 58.97% of the top-50,000 compounds after screening only 0.6% of an ultralarge library containing 99.5 million compounds, improving 8% over the previous state-of-the-art baseline. Through extensive benchmarks, we show that the superior performance of pretrained models persists in both structure-based and ligand-based drug discovery. Pretrained models can serve as a boost to the accuracy and sample efficiency of active learning-based virtual screening.
虚拟筛选大型化合物库以识别潜在的命中候选物是药物发现的最早步骤之一。随着商业上可用的化合物库的规模呈指数级增长到数十亿规模,主动学习和贝叶斯优化最近已被证明是缩小搜索空间的有效方法。这些方法的一个重要组成部分是一个替代机器学习模型,用于预测化合物的所需性质。通过仅对虚拟筛选的整个库的一小部分进行筛选,准确的模型可以实现高样本效率,从而找到命中。在这项研究中,我们在贝叶斯优化主动学习框架中检查了基于预训练的转换器的语言模型和图神经网络的性能。最佳的预训练模型在筛选仅包含 9950 万个化合物的超大型库的 0.6%后,可识别出前 50000 个化合物中的 58.97%,比以前的最先进基线提高了 8%。通过广泛的基准测试,我们表明,在基于结构和基于配体的药物发现中,预训练模型的优越性能仍然存在。预训练模型可以提高基于主动学习的虚拟筛选的准确性和样本效率。