Hossain Delower, Saghapour Ehsan, Chen Jake Y
Department of Computer Science, The University of Alabama at Birmingham, Birmingham, AL, United States.
System Pharmacology and AI Research Center (SPARC), The University of Alabama at Birmingham, Birmingham, AL, United States.
Front Bioinform. 2025 Jul 21;5:1603133. doi: 10.3389/fbinf.2025.1603133. eCollection 2025.
Diabetes Mellitus (DM) constitutes a global epidemic and is one of the top ten leading causes of mortality (WHO, 2019), projected to rank seventh by 2030. The US National Diabetes Statistics Report (2021) states that 38.4 million Americans have diabetes. Dipeptidyl Peptidase-4 (DPP-4) is an FDA-approved target for the treatment of type 2 diabetes mellitus (T2DM). However, current DPP-4 inhibitors may cause adverse effects, including gastrointestinal issues, severe joint pain (FDA safety warning), nasopharyngitis, hypersensitivity, and nausea. Moreover, the development of novel drugs and the assessment of DPP-4 inhibition are both costly and often impractical. These challenges highlight the urgent need for efficient approaches to facilitate the discovery and optimization of safer and more effective DPP-4 inhibitors.
Quantitative Structure-Activity Relationship (QSAR) modeling is a widely used computational approach for evaluating the properties of chemical substances. In this study, we employed a Neuro-symbolic (NeSy) approach, specifically the Logic Tensor Network (LTN), to develop a DPP-4 QSAR model capable of identifying potential small-molecule inhibitors and predicting bioactivity classification. For comparison, we also implemented baseline models using Deep Neural Networks (DNNs) and Transformers. A total of 6,563 bioactivity records (SMILES-based compounds with IC values) were collected from ChEMBL, PubChem, BindingDB, and GTP. Feature sets used for model training included descriptors (CDK Extended-PaDEL), fingerprints (Morgan), chemical language model embeddings (ChemBERTa-2), LLaMa 3.2 embedding features, and physicochemical properties.
Among all tested configurations, the Neuro-symbolic QSAR model (NeSyDPP-4) performed best using a combination of CDK extended and Morgan fingerprints. The model achieved an accuracy of 0.9725, an F1-score of 0.9723, an ROC AUC of 0.9719, and a Matthews correlation coefficient (MCC) of 0.9446. These results outperformed the baseline DNN and Transformer models, as well as existing state-of-the-art (SOTA) methods. To further validate the robustness of the model, we conducted an external evaluation using the Drug Target Common (DTC) dataset, where NeSyDPP-4 also demonstrated strong performance, with an accuracy of 0.9579, an AUC-ROC of 0.9565, a Matthews Correlation Coefficient (MCC) of 0.9171, and an F1-score of 0.9577.
These findings suggest that the NeSyDPP-4 model not only delivered high predictive performance but also demonstrated generalizability to external datasets. This approach presents a cost-effective and reliable alternative to traditional vivo screening, offering valuable support for the identification and classification of biologically active DPP-4 inhibitors in the treatment of type 2 diabetes mellitus (T2DM).
糖尿病(DM)是一种全球性流行病,是全球十大主要死因之一(世界卫生组织,2019年),预计到2030年将升至第七位。美国国家糖尿病统计报告(2021年)指出,有3840万美国人患有糖尿病。二肽基肽酶-4(DPP-4)是美国食品药品监督管理局(FDA)批准的用于治疗2型糖尿病(T2DM)的靶点。然而,目前的DPP-4抑制剂可能会引起不良反应,包括胃肠道问题、严重关节疼痛(FDA安全警告)、鼻咽炎、过敏反应和恶心。此外,新型药物的研发以及DPP-4抑制作用的评估成本高昂且往往不切实际。这些挑战凸显了迫切需要高效的方法来促进更安全、更有效的DPP-4抑制剂的发现和优化。
定量构效关系(QSAR)建模是一种广泛应用于评估化学物质性质的计算方法。在本研究中,我们采用了一种神经符号(NeSy)方法,即逻辑张量网络(LTN),来开发一个能够识别潜在小分子抑制剂并预测生物活性分类的DPP-4 QSAR模型。为了进行比较,我们还使用深度神经网络(DNN)和Transformer实现了基线模型。总共从ChEMBL、PubChem、BindingDB和GTP收集了6563条生物活性记录(基于SMILES的化合物及其IC值)。用于模型训练的特征集包括描述符(CDK扩展PaDEL)、指纹(摩根指纹)、化学语言模型嵌入(ChemBERTa-2)、LLaMa 3.2嵌入特征和物理化学性质。
在所有测试配置中,神经符号QSAR模型(NeSyDPP-4)使用CDK扩展指纹和摩根指纹的组合表现最佳。该模型的准确率为0.9725,F1分数为0.9723,ROC曲线下面积(AUC)为0.9719,马修斯相关系数(MCC)为0.9446。这些结果优于基线DNN和Transformer模型以及现有的最先进(SOTA)方法。为了进一步验证模型的稳健性,我们使用药物靶点通用(DTC)数据集进行了外部评估,其中NeSyDPP-4也表现出强大的性能,准确率为0.9579,AUC-ROC为0.9565,马修斯相关系数(MCC)为0.9171,F1分数为0.9577。
这些发现表明,NeSyDPP-4模型不仅具有较高的预测性能,而且对外部数据集具有通用性。这种方法为传统的体内筛选提供了一种经济高效且可靠的替代方案,为2型糖尿病(T2DM)治疗中生物活性DPP-4抑制剂的识别和分类提供了有价值的支持。