Banerjee Arkaprava, Roy Kunal
Drug Theoretics and Cheminformatics Laboratory, Department of Pharmaceutical Technology, Jadavpur University, Kolkata, 700 032, India.
Sci Rep. 2025 Jan 4;15(1):808. doi: 10.1038/s41598-024-85063-y.
We have adopted the classification Read-Across Structure-Activity Relationship (c-RASAR) approach in the present study for machine-learning (ML)-based model development from a recently reported curated dataset of nephrotoxicity potential of orally active drugs. We initially developed ML models using nine different algorithms separately on topological descriptors (referred to as simply "descriptors" in the subsequent sections of the manuscript) and MACCS fingerprints (referred to as "fingerprints" in the subsequent sections of the manuscript), thus generating 18 different ML QSAR models. Using the chemical spaces defined by the modeling descriptors and fingerprints, the similarity and error-based RASAR descriptors were computed, and the most discriminating RASAR descriptors were used to develop another set of 18 different ML c-RASAR models. All 36 models were cross-validated 20 times with a fivefold cross-validation strategy, and their predictivity was checked on the test set data. A multi-criteria decision-making strategy - the Sum of Ranking Differences (SRD) approach-was adopted to identify the best-performing model based on robustness and external validation parameters. This statistical analysis suggested that the c-RASAR models had an overall good performance, while the best-performing model was also a c-RASAR model (LDA c-RASAR model derived from topological descriptors, with MCC values of 0.229 and 0.431 for the training and test sets, respectively). This model was used to screen a true external data set prepared from the known nephrotoxic compounds of DrugBankDB, demonstrating good predictivity.
在本研究中,我们采用了分类读通结构-活性关系(c-RASAR)方法,基于最近报告的口服活性药物肾毒性潜力的精选数据集进行机器学习(ML)模型开发。我们最初分别使用九种不同算法,基于拓扑描述符(在手稿后续部分简称为“描述符”)和MACCS指纹(在手稿后续部分简称为“指纹”)开发ML模型,从而生成18种不同的ML QSAR模型。利用建模描述符和指纹定义的化学空间,计算了基于相似性和误差的RASAR描述符,并使用最具区分性的RASAR描述符开发了另一组18种不同的ML c-RASAR模型。所有36个模型均采用五折交叉验证策略进行了20次交叉验证,并在测试集数据上检查了它们的预测能力。采用了一种多标准决策策略——排名差异总和(SRD)方法——基于稳健性和外部验证参数来识别性能最佳的模型。该统计分析表明,c-RASAR模型总体表现良好,而性能最佳的模型也是一个c-RASAR模型(源自拓扑描述符的LDA c-RASAR模型,训练集和测试集的MCC值分别为0.229和0.431)。该模型用于筛选从DrugBankDB的已知肾毒性化合物制备的真实外部数据集,显示出良好的预测能力。