Hossain Delower, Saghapour Ehsan, Chen Jake Y
bioRxiv. 2025 Apr 5:2025.03.31.646336. doi: 10.1101/2025.03.31.646336.
Diabetes Mellitus (DM) is a global epidemic and among the top ten leading causes of mortality (WHO, 2019), projected to rank seventh by 2030. The US National Diabetes Statistics Report (2021) states that 38.4 million Americans have diabetes. Dipeptidyl Peptidase-4 (DPP-4) is an FDA-approved target for type 2 diabetes mellitus (T2DM) treatment. However, current DPP-4 inhibitors are associated with adverse effects, including gastrointestinal issues, severe joint pain (FDA safety warning), nasopharyngitis, hypersensitivity, and nausea. Identifying novel inhibitors is crucial. Direct in vivo DPP-4 inhibition assessment is costly and impractical, making in silico IC50 prediction a viable alternative. Quantitative Structure-Activity Relationship (QSAR) modeling is a widely used computational approach for chemical substance assessment. We employ LTN, a neuro-symbolic approach, alongside DNN and transformers as baselines. DPP-4-related data is sourced from PubChem, ChEMBL, BindingDB, and GTP, comprising 6,563 bioactivity records (SMILES-based compounds with IC50 values) after deduplication and thresholding. A diverse set of features including descriptors (CDK Extended-PaDEL), fingerprints (Morgan), chemical language model embeddings (ChemBERTa2), LLaMa 3.2, and physicochemical properties is used to train the NeSyDPP4-QSAR model. The NeSyDPP4-QSAR model yielded the highest accuracy, incorporating CDKextended and Morgan fingerprints, with an accuracy of 0.9725, an F1-score of 0.9723, an ROC AUC of 0.9719, and an MCC of 0.9446. The performance was benchmarked against two standard baseline models: a deep neural network and a transformer. To ensure fair comparisons, DNN models used the equivalent attributes with the same dimension and network configuration as NeSyDPP4-QSAR. Our findings showed that integrating the Neuro-symbolic strategy (neural network-based learning and symbolic reasoning) holds immense potential for discovering drugs that can inhibit diabetes mellitus and classifying biological activities that inhibit it.
糖尿病(DM)是一种全球性流行病,是十大主要死因之一(世界卫生组织,2019年),预计到2030年将升至第七位。美国国家糖尿病统计报告(2021年)指出,有3840万美国人患有糖尿病。二肽基肽酶-4(DPP-4)是美国食品药品监督管理局(FDA)批准的用于2型糖尿病(T2DM)治疗的靶点。然而,目前的DPP-4抑制剂存在不良反应,包括胃肠道问题、严重关节疼痛(FDA安全警告)、鼻咽炎、过敏反应和恶心。因此,寻找新型抑制剂至关重要。直接在体内评估DPP-4抑制作用成本高昂且不切实际,使得通过计算机模拟预测IC50成为一种可行的替代方法。定量构效关系(QSAR)建模是一种广泛应用于化学物质评估的计算方法。我们采用基于神经符号的方法LTN,并将深度神经网络(DNN)和变压器模型作为基线进行比较。DPP-4相关数据来源于PubChem、ChEMBL、BindingDB和GTP,经过去重和阈值处理后,包含6563条生物活性记录(基于SMILES的化合物及其IC50值)。我们使用了一系列不同的特征,包括描述符(CDK Extended-PaDEL)、指纹(Morgan)、化学语言模型嵌入(ChemBERTa2)、LLaMa 3.2以及物理化学性质,来训练NeSyDPP4-QSAR模型。NeSyDPP4-QSAR模型结合CDKextended和Morgan指纹,获得了最高的准确率,其准确率为0.9725,F1分数为0.9723,ROC曲线下面积(ROC AUC)为0.9719,马修斯相关系数(MCC)为0.9446。该模型的性能以两个标准基线模型为基准进行评估:深度神经网络和变压器模型。为确保公平比较,DNN模型使用了与NeSyDPP4-QSAR相同维度和网络配置的等效属性。我们的研究结果表明,整合神经符号策略(基于神经网络的学习和符号推理)在发现能够抑制糖尿病的药物以及对抑制糖尿病的生物活性进行分类方面具有巨大潜力。