Li Yanfei, Chen Xiran, Wang Shuqin, Wei Jinmao
College of Computer Science, Nankai University, 300071, Tianjin, China.
National Heart & Lung Institute, Imperial College London, SW3 6LY, London, UK.
BMC Biol. 2025 Aug 5;23(1):243. doi: 10.1186/s12915-025-02338-0.
Learning-based methods have recently demonstrated strong potential in predicting drug-protein interactions (DPIs). However, existing approaches often fail to achieve accurate predictions on real-world imbalanced datasets while maintaining high generalizability and scalability, limiting their practical applicability.
This study proposes a highly generalized model, GLDPI, aimed at improving prediction accuracy in imbalanced scenarios by preserving the topological relationships among initial molecular representations in the embedding space. Specifically, GLDPI employs dedicated encoders to transform one-dimensional sequence information of drugs and proteins into embedding representations and efficiently calculates the likelihood of DPIs using cosine similarity. Additionally, we introduce a prior loss function based on the guilt-by-association principle to ensure that the topology of the embedding space aligns with the structure of the initial drug-protein network. This design enables GLDPI to effectively capture network relationships and key features of molecular interactions, thereby significantly enhancing predictive performance.
Experimental results highlight GLDPI's superior performance on multiple highly imbalanced benchmark datasets, achieving over a 100% improvement in the AUPR metric compared to state-of-the-art methods. Additionally, GLDPI demonstrates exceptional generalization capabilities in cold-start experiments, excelling in predicting novel drug-protein interactions. Furthermore, the model exhibits remarkable scalability, efficiently inferring approximately drug-protein pairs in less than 10 h.
基于学习的方法最近在预测药物-蛋白质相互作用(DPI)方面显示出强大的潜力。然而,现有的方法在处理现实世界中的不平衡数据集时,往往难以在保持高泛化性和可扩展性的同时实现准确的预测,这限制了它们的实际应用。
本研究提出了一种高度泛化的模型GLDPI,旨在通过保留嵌入空间中初始分子表示之间的拓扑关系来提高不平衡场景下的预测准确性。具体而言,GLDPI采用专用编码器将药物和蛋白质的一维序列信息转换为嵌入表示,并使用余弦相似度有效地计算DPI的可能性。此外,我们引入了一种基于关联有罪原则的先验损失函数,以确保嵌入空间的拓扑结构与初始药物-蛋白质网络的结构一致。这种设计使GLDPI能够有效地捕捉网络关系和分子相互作用的关键特征,从而显著提高预测性能。
实验结果突出了GLDPI在多个高度不平衡基准数据集上的卓越性能,与现有方法相比,AUPR指标提高了100%以上。此外,GLDPI在冷启动实验中表现出出色的泛化能力,在预测新型药物-蛋白质相互作用方面表现优异。此外,该模型具有显著的可扩展性,能够在不到10小时的时间内高效推断出约 药物-蛋白质对。