State Key Laboratory of Chemical Resource Engineering, Department of Pharmaceutical Engineering, Beijing University of Chemical Technology, Beijing, P. R. China.
SAR QSAR Environ Res. 2024 Jul;35(7):531-563. doi: 10.1080/1062936X.2024.2375513. Epub 2024 Jul 30.
The 3C-like Proteinase (3CLpro) of novel coronaviruses is intricately linked to viral replication, making it a crucial target for antiviral agents. In this study, we employed two fingerprint descriptors (ECFP_4 and MACCS) to comprehensively characterize 889 compounds in our dataset. We constructed 24 classification models using machine learning algorithms, including Support Vector Machine (SVM), Random Forest (RF), extreme Gradient Boosting (XGBoost), and Deep Neural Networks (DNN). Among these models, the DNN- and ECFP_4-based Model 1D_2 achieved the most promising results, with a remarkable Matthews correlation coefficient (MCC) value of 0.796 in the 5-fold cross-validation and 0.722 on the test set. The application domains of the models were analysed using d calculations. The collected 889 compounds were clustered by K-means algorithm, and the relationships between structural fragments and inhibitory activities of the highly active compounds were analysed for the 10 obtained subsets. In addition, based on 464 3CLpro inhibitors, 27 QSAR models were constructed using three machine learning algorithms with a minimum root mean square error (RMSE) of 0.509 on the test set. The applicability domains of the models and the structure-activity relationships responded from the descriptors were also analysed.
新型冠状病毒的 3C 样蛋白酶(3CLpro)与病毒复制密切相关,是抗病毒药物的重要靶点。在本研究中,我们使用了两种指纹描述符(ECFP_4 和 MACCS)来全面描述我们数据集中的 889 种化合物。我们使用机器学习算法构建了 24 个分类模型,包括支持向量机(SVM)、随机森林(RF)、极端梯度提升(XGBoost)和深度神经网络(DNN)。在这些模型中,基于 DNN 和 ECFP_4 的模型 1D_2 取得了最有前景的结果,在 5 折交叉验证中的马修斯相关系数(MCC)值为 0.796,在测试集中为 0.722。使用 d 计算分析了模型的应用领域。使用 K-均值算法对 889 种化合物进行聚类,对 10 个获得的子集的高活性化合物的结构片段与抑制活性之间的关系进行了分析。此外,基于 464 种 3CLpro 抑制剂,使用三种机器学习算法构建了 27 个 QSAR 模型,在测试集上的最小均方根误差(RMSE)为 0.509。还分析了模型的适用域以及从描述符中得到的结构-活性关系。