Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore.
Chern Institute of Mathematics and LPMC, Nankai University, Tianjin, China.
PLoS Comput Biol. 2022 Apr 6;18(4):e1009943. doi: 10.1371/journal.pcbi.1009943. eCollection 2022 Apr.
With the great advancements in experimental data, computational power and learning algorithms, artificial intelligence (AI) based drug design has begun to gain momentum recently. AI-based drug design has great promise to revolutionize pharmaceutical industries by significantly reducing the time and cost in drug discovery processes. However, a major issue remains for all AI-based learning model that is efficient molecular representations. Here we propose Dowker complex (DC) based molecular interaction representations and Riemann Zeta function based molecular featurization, for the first time. Molecular interactions between proteins and ligands (or others) are modeled as Dowker complexes. A multiscale representation is generated by using a filtration process, during which a series of DCs are generated at different scales. Combinatorial (Hodge) Laplacian matrices are constructed from these DCs, and the Riemann zeta functions from their spectral information can be used as molecular descriptors. To validate our models, we consider protein-ligand binding affinity prediction. Our DC-based machine learning (DCML) models, in particular, DC-based gradient boosting tree (DC-GBT), are tested on three most-commonly used datasets, i.e., including PDBbind-2007, PDBbind-2013 and PDBbind-2016, and extensively compared with other existing state-of-the-art models. It has been found that our DC-based descriptors can achieve the state-of-the-art results and have better performance than all machine learning models with traditional molecular descriptors. Our Dowker complex based machine learning models can be used in other tasks in AI-based drug design and molecular data analysis.
随着实验数据、计算能力和学习算法的巨大进步,基于人工智能(AI)的药物设计最近开始获得动力。基于人工智能的药物设计有望通过大大缩短药物发现过程中的时间和成本,彻底改变制药行业。然而,对于所有基于人工智能的学习模型来说,一个主要问题仍然是有效的分子表示。在这里,我们首次提出了基于道克复合物(DC)的分子相互作用表示法和基于黎曼 ζ 函数的分子特征化方法。蛋白质和配体(或其他)之间的分子相互作用被建模为道克复合物。通过使用过滤过程生成多尺度表示,在此过程中,在不同尺度上生成一系列 DC。从这些 DC 构建组合(Hodge)拉普拉斯矩阵,并从其光谱信息中使用黎曼 ζ 函数作为分子描述符。为了验证我们的模型,我们考虑了蛋白质-配体结合亲和力预测。我们的基于 DC 的机器学习(DCML)模型,特别是基于 DC 的梯度提升树(DC-GBT),在三个最常用的数据集上进行了测试,即包括 PDBbind-2007、PDBbind-2013 和 PDBbind-2016,并与其他现有的最先进的模型进行了广泛比较。结果发现,我们的基于 DC 的描述符可以达到最先进的结果,并且比所有使用传统分子描述符的机器学习模型性能更好。我们的基于道克复合物的机器学习模型可以用于基于人工智能的药物设计和分子数据分析中的其他任务。