Wang Dong, Jin Jieyu, Shi Guqin, Bao Jingxiao, Wang Zheng, Li Shimeng, Pan Peichen, Li Dan, Kang Yu, Hou Tingjun
Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, Zhejiang, China.
Shanghai Qilu Pharmaceutical R&D Center, 576 Libing Road, Pudong New Area District, Shanghai, 310115, China.
J Cheminform. 2025 Jan 10;17(1):3. doi: 10.1186/s13321-025-00947-z.
The Caco-2 cell model has been widely used to assess the intestinal permeability of drug candidates in vitro, owing to its morphological and functional similarity to human enterocytes. While Caco-2 cell assay is considered safe and cost-effective, it is also characterized by being time-consuming. Therefore, computational models that achieve high accuracies in predicting Caco-2 permeability are crucial for enhancing the efficiency of oral drug development. In this study, we conducted an in-depth analysis of the characteristics of an augmented Caco-2 permeability dataset, and evaluated a diverse range of machine learning algorithms in combination with different molecular representations. The results indicated that XGBoost generally provided better predictions than comparable models for the test sets. In addition, we investigated the transferability of machine learning models trained on publicly available data to internal pharmaceutical industry datasets. Our findings, based on the Shanghai Qilu's in-house dataset, showed that the boosting models retained a degree of predictive efficacy when applied to industry data. Furthermore, Y-randomization test and applicability domain analysis were employed to assess the robustness and generalizability of these models. Matched Molecular Pair Analysis (MMPA) was utilized to extract chemical transformation rules. We believe that the model developed in this study could represent a reliable tool for assessing Caco-2 permeability during early-stage drug discovery and the chemical transformation rules derived here could provide insights for optimizing Caco-2 permeability.Scientific contributionA comprehensive validation of various machine learning algorithms combined with diverse molecular representations on a large dataset for predicting Caco-2 permeability was reported. The transferability of machine learning models trained on publicly available data to internal pharmaceutical industry datasets was also investigated. Matched molecular pair analysis was carried out to provide reasonable suggestions for researchers to improve the Caco-2 permeability of compounds.
由于Caco-2细胞模型在形态和功能上与人肠上皮细胞相似,它已被广泛用于体外评估候选药物的肠道通透性。虽然Caco-2细胞测定被认为是安全且具有成本效益的,但它也具有耗时的特点。因此,在预测Caco-2通透性方面具有高精度的计算模型对于提高口服药物开发的效率至关重要。在本研究中,我们对一个扩充的Caco-2通透性数据集的特征进行了深入分析,并结合不同的分子表示评估了多种机器学习算法。结果表明,对于测试集,XGBoost通常比可比模型提供更好的预测。此外,我们研究了在公开可用数据上训练的机器学习模型对内部制药行业数据集的可转移性。我们基于上海齐鲁的内部数据集的研究结果表明,当应用于行业数据时,增强模型保留了一定程度的预测效力。此外,采用Y随机化测试和适用域分析来评估这些模型的稳健性和通用性。利用匹配分子对分析(MMPA)提取化学转化规则。我们相信,本研究中开发的模型可以成为早期药物发现过程中评估Caco-2通透性的可靠工具,并且这里得出的化学转化规则可以为优化Caco-2通透性提供见解。
科学贡献
报道了在一个大型数据集上对各种机器学习算法与不同分子表示相结合来预测Caco-2通透性的全面验证。还研究了在公开可用数据上训练的机器学习模型对内部制药行业数据集的可转移性。进行了匹配分子对分析,为研究人员提高化合物的Caco-2通透性提供合理建议。