Shen Wen-Feng, Tang He-Wei, Li Jia-Bo, Li Xiang, Chen Si
School of Medicine & School of Computer Engineering and Science, Shanghai University, Shanghai, 200444, China.
School of Pharmacy, Second Military Medical University, Shanghai, 200433, China.
J Cheminform. 2023 Jan 11;15(1):5. doi: 10.1186/s13321-022-00675-8.
Ubiquitin-specific-processing protease 7 (USP7) is a promising target protein for cancer therapy, and great attention has been given to the identification of USP7 inhibitors. Traditional virtual screening methods have now been successfully applied to discover USP7 inhibitors aiming at reducing costs and speeding up time in several studies. However, due to their unsatisfactory accuracy, it is still a difficult task to develop USP7 inhibitors. In this study, multiple supervised learning classifiers were built to distinguish active USP7 inhibitors from inactive ligands. Physicochemical descriptors, MACCS keys, ECFP4 fingerprints and SMILES were first calculated to represent the compounds in our in-house dataset. Two deep learning (DL) models and nine classical machine learning (ML) models were then constructed based on different combinations of the above molecular representations under three activity cutoff values, and a total of 15 groups of experiments (75 experiments) were implemented. The performance of the models in these experiments was evaluated, compared and discussed using a variety of metrics. The optimal models are ensemble learning models when the dataset is balanced or severely imbalanced, and SMILES-based DL performs the best when the dataset is slightly imbalanced. Meanwhile, multimodal data fusion in some cases can improve the performance of ML and DL models. In addition, SMOTE, unbiased decoy selection and SMILES enumeration can improve the performance of ML and DL models when the dataset is severely imbalanced, and SMOTE works the best. Our study established highly accurate supervised learning classification models, which would accelerate the development of USP7 inhibitors. Some guidance was also provided for drug researchers in selecting supervised models and molecular representations as well as handling imbalanced datasets.
泛素特异性加工蛋白酶7(USP7)是一种很有前景的癌症治疗靶蛋白,人们对USP7抑制剂的鉴定给予了极大关注。在多项研究中,传统的虚拟筛选方法现已成功应用于发现USP7抑制剂,旨在降低成本并加快研发进程。然而,由于其准确性不尽人意,开发USP7抑制剂仍然是一项艰巨的任务。在本研究中,构建了多个监督学习分类器,以区分活性USP7抑制剂和非活性配体。首先计算物理化学描述符、MACCS键、ECFP4指纹和SMILES,以表示我们内部数据集中的化合物。然后基于上述分子表示的不同组合,在三个活性截止值下构建了两个深度学习(DL)模型和九个经典机器学习(ML)模型,并总共进行了15组实验(75次实验)。使用各种指标对这些实验中模型的性能进行了评估、比较和讨论。当数据集平衡或严重不平衡时,最优模型是集成学习模型;当数据集略有不平衡时,基于SMILES的深度学习表现最佳。同时,在某些情况下,多模态数据融合可以提高机器学习和深度学习模型的性能。此外,当数据集严重不平衡时,SMOTE、无偏诱饵选择和SMILES枚举可以提高机器学习和深度学习模型的性能,其中SMOTE效果最佳。我们的研究建立了高度准确的监督学习分类模型,这将加速USP7抑制剂的开发。还为药物研究人员在选择监督模型和分子表示以及处理不平衡数据集方面提供了一些指导。