Masood Muhammad Arslan, Kaski Samuel, Cui Tianyu
Department of Computer Science, Aalto University, Espoo, Finland.
Department of Computer Science, University of Manchester, Manchester, UK.
J Cheminform. 2025 Apr 23;17(1):58. doi: 10.1186/s13321-025-00986-6.
In drug discovery, prioritizing compounds for experimental testing is a critical task that can be optimized through active learning by strategically selecting informative molecules. Active learning typically trains models on labeled examples alone, while unlabeled data is only used for acquisition. This fully supervised approach neglects valuable information present in unlabeled molecular data, impairing both predictive performance and the molecule selection process. We address this limitation by integrating a transformer-based BERT model, pretrained on 1.26 million compounds, into the active learning pipeline. This effectively disentangles representation learning and uncertainty estimation, leading to more reliable molecule selection. Experiments on Tox21 and ClinTox datasets demonstrate that our approach achieves equivalent toxic compound identification with 50% fewer iterations compared to conventional active learning. Analysis reveals that pretrained BERT representations generate a structured embedding space enabling reliable uncertainty estimation despite limited labeled data, confirmed through Expected Calibration Error measurements. This work establishes that combining pretrained molecular representations with active learning significantly improves both model performance and acquisition efficiency in drug discovery, providing a scalable framework for compound prioritization. SCIENTIFIC CONTRIBUTION: We demonstrate that high-quality molecular representations fundamentally determine active learning success in drug discovery, outweighing acquisition strategy selection. We provide a framework that integrates pretrained transformer models with Bayesian active learning to separate representation learning from uncertainty estimation-a critical distinction in low-data scenarios. This approach establishes a foundation for more efficient screening workflows across diverse pharmaceutical applications.
在药物发现中,为实验测试对化合物进行优先级排序是一项关键任务,可通过主动学习,即策略性地选择信息丰富的分子来进行优化。主动学习通常仅在有标签的示例上训练模型,而无标签数据仅用于数据采集。这种完全监督的方法忽略了无标签分子数据中存在的有价值信息,损害了预测性能和分子选择过程。我们通过将在126万个化合物上预训练的基于Transformer的BERT模型集成到主动学习管道中来解决这一限制。这有效地解开了表示学习和不确定性估计,从而实现更可靠的分子选择。在Tox21和ClinTox数据集上的实验表明,与传统主动学习相比,我们的方法在迭代次数减少50%的情况下实现了同等的有毒化合物识别。分析表明,预训练的BERT表示生成了一个结构化的嵌入空间,尽管标记数据有限,但仍能实现可靠的不确定性估计,这通过预期校准误差测量得到了证实。这项工作表明,将预训练的分子表示与主动学习相结合,可显著提高药物发现中的模型性能和采集效率,为化合物优先级排序提供了一个可扩展的框架。科学贡献:我们证明了高质量的分子表示从根本上决定了药物发现中主动学习的成功,其重要性超过了采集策略的选择。我们提供了一个框架,将预训练的Transformer模型与贝叶斯主动学习相结合,以将表示学习与不确定性估计分开——这在低数据场景中是一个关键区别。这种方法为跨多种制药应用的更高效筛选工作流程奠定了基础。