Ding Xiaoyu, Cui Rongrong, Yu Jie, Liu Tiantian, Zhu Tingfei, Wang Dingyan, Chang Jie, Fan Zisheng, Liu Xiaomeng, Chen Kaixian, Jiang Hualiang, Li Xutong, Luo Xiaomin, Zheng Mingyue
Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China.
University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing 100049, China.
J Med Chem. 2021 Nov 25;64(22):16838-16853. doi: 10.1021/acs.jmedchem.1c01683. Epub 2021 Nov 15.
The success of artificial intelligence (AI) models has been limited by the requirement of large amounts of high-quality training data, which is just the opposite of the situation in most drug discovery pipelines. Active learning (AL) is a subfield of AI that focuses on algorithms that select the data they need to improve their models. Here, we propose a two-phase AL pipeline and apply it to the prediction of drug oral plasma exposure. In phase I, the AL-based model demonstrated a remarkable capability to sample informative data from a noisy data set, which used only 30% of the training data to yield a prediction capability with an accuracy of 0.856 on an independent test set. In phase II, the AL-based model explored a large diverse chemical space (855K samples) for experimental testing and feedback. Improved accuracy and new highly confident predictions (50K samples) were observed, which suggest that the model's applicability domain has been significantly expanded.
人工智能(AI)模型的成功受到大量高质量训练数据需求的限制,而这与大多数药物发现流程的情况恰恰相反。主动学习(AL)是AI的一个子领域,专注于选择所需数据以改进模型的算法。在此,我们提出了一个两阶段的主动学习流程,并将其应用于药物口服血浆暴露的预测。在第一阶段,基于主动学习的模型展示了从噪声数据集中采样信息数据的卓越能力,该模型仅使用30%的训练数据就在独立测试集上产生了准确率为0.856的预测能力。在第二阶段,基于主动学习的模型探索了一个多样化的大化学空间(855K个样本)用于实验测试和反馈。观察到准确率有所提高以及出现了新的高可信度预测(50K个样本),这表明模型的适用范围已显著扩大。