Kunming University of Science and Technology, Kunming, CN 650500, China.
Kunming University of Science and Technology, Kunming, CN 650500, China.
J Mol Graph Model. 2024 Dec;133:108851. doi: 10.1016/j.jmgm.2024.108851. Epub 2024 Aug 30.
Human oral bioavailability is a crucial factor in drug discovery. In recent years, researchers have constructed a variety of different prediction models. However, given the limited size of human oral bioavailability data sets, the challenge of making accurate predictions with small sample sizes has become a critical issue in the field. The deep forest model, with its adaptively determinable number of cascade levels, can perform exceptionally well even on small-scale data. However, the original deep forest suffers unbalanced multi-grained scanning process and premature stopping of cascade forest training. In this paper, we propose a human oral bioavailability predict method based on an improved deep forest, called balanced multi-grained scanning mapping cascade forest (bgmc-forest). Firstly, the mordred descriptor method is selected to feature extraction, then enhanced features are obtained by the improved balanced multi-grained scanning, which solves the problem of missing features at both ends. And finally, the prediction results are obtained by feature mapping cascaded forests, which is based on principal component analysis and cascade forests, ensures the effectiveness of the cascade forest. The superiority of the model constructed in this paper is demonstrated through comparative experiments, while the effectiveness of the improved module is verified through ablation experiments. Finally the decision-making process of the model is explained by the shapley additive explanations interpretation algorithm.
人体口服生物利用度是药物发现的一个关键因素。近年来,研究人员构建了各种不同的预测模型。然而,由于人体口服生物利用度数据集的规模有限,因此在小样本量下进行准确预测的挑战已成为该领域的一个关键问题。深度森林模型具有自适应确定的级联层数,可以在小规模数据上表现得非常出色。但是,原始的深度森林存在不平衡的多粒度扫描过程和级联森林训练的过早停止问题。在本文中,我们提出了一种基于改进的深度森林的人体口服生物利用度预测方法,称为平衡多粒度扫描映射级联森林(bgmc-forest)。首先,选择 mordred 描述符方法进行特征提取,然后通过改进的平衡多粒度扫描获得增强特征,解决了两端特征缺失的问题。最后,通过基于主成分分析和级联森林的特征映射级联森林获得预测结果,保证了级联森林的有效性。通过对比实验证明了所构建模型的优越性,通过消融实验验证了改进模块的有效性。最后,通过 shapley additive explanations 解释算法解释了模型的决策过程。