Gorostiola González Marina, van den Broek Remco L, Braun Thomas G M, Chatzopoulou Magdalini, Jespers Willem, IJzerman Adriaan P, Heitman Laura H, van Westen Gerard J P
Division of Drug Discovery and Safety, Leiden Academic Centre for Drug Research, Leiden University, Leiden, The Netherlands.
ONCODE Institute, Leiden, The Netherlands.
J Cheminform. 2023 Aug 28;15(1):74. doi: 10.1186/s13321-023-00745-5.
Proteochemometric (PCM) modelling is a powerful computational drug discovery tool used in bioactivity prediction of potential drug candidates relying on both chemical and protein information. In PCM features are computed to describe small molecules and proteins, which directly impact the quality of the predictive models. State-of-the-art protein descriptors, however, are calculated from the protein sequence and neglect the dynamic nature of proteins. This dynamic nature can be computationally simulated with molecular dynamics (MD). Here, novel 3D dynamic protein descriptors (3DDPDs) were designed to be applied in bioactivity prediction tasks with PCM models. As a test case, publicly available G protein-coupled receptor (GPCR) MD data from GPCRmd was used. GPCRs are membrane-bound proteins, which are activated by hormones and neurotransmitters, and constitute an important target family for drug discovery. GPCRs exist in different conformational states that allow the transmission of diverse signals and that can be modified by ligand interactions, among other factors. To translate the MD-encoded protein dynamics two types of 3DDPDs were considered: one-hot encoded residue-specific (rs) and embedding-like protein-specific (ps) 3DDPDs. The descriptors were developed by calculating distributions of trajectory coordinates and partial charges, applying dimensionality reduction, and subsequently condensing them into vectors per residue or protein, respectively. 3DDPDs were benchmarked on several PCM tasks against state-of-the-art non-dynamic protein descriptors. Our rs- and ps3DDPDs outperformed non-dynamic descriptors in regression tasks using a temporal split and showed comparable performance with a random split and in all classification tasks. Combinations of non-dynamic descriptors with 3DDPDs did not result in increased performance. Finally, the power of 3DDPDs to capture dynamic fluctuations in mutant GPCRs was explored. The results presented here show the potential of including protein dynamic information on machine learning tasks, specifically bioactivity prediction, and open opportunities for applications in drug discovery, including oncology.
蛋白质化学计量学(PCM)建模是一种强大的计算药物发现工具,用于依靠化学和蛋白质信息对潜在药物候选物的生物活性进行预测。在PCM中,通过计算特征来描述小分子和蛋白质,这直接影响预测模型的质量。然而,目前最先进的蛋白质描述符是根据蛋白质序列计算得出的,忽略了蛋白质的动态性质。这种动态性质可以通过分子动力学(MD)进行计算模拟。在此,设计了新颖的三维动态蛋白质描述符(3DDPD),用于PCM模型的生物活性预测任务。作为一个测试案例,使用了来自GPCRmd的公开可用的G蛋白偶联受体(GPCR)MD数据。GPCR是膜结合蛋白,由激素和神经递质激活,是药物发现的重要靶标家族。GPCR存在于不同的构象状态,允许传递多种信号,并且可以通过配体相互作用等因素进行修饰。为了转化MD编码的蛋白质动力学,考虑了两种类型的3DDPD:独热编码的残基特异性(rs)和类似嵌入的蛋白质特异性(ps)3DDPD。通过计算轨迹坐标和部分电荷的分布、应用降维,然后分别将它们浓缩为每个残基或蛋白质的向量来开发描述符。在几个PCM任务中,针对最先进的非动态蛋白质描述符对3DDPD进行了基准测试。我们的rs-和ps3DDPD在使用时间分割的回归任务中优于非动态描述符,并且在随机分割的情况下表现相当,在所有分类任务中也是如此。非动态描述符与3DDPD的组合并没有提高性能。最后,探索了3DDPD捕捉突变GPCR中动态波动的能力。此处呈现的结果显示了在机器学习任务(特别是生物活性预测)中纳入蛋白质动态信息的潜力,并为药物发现(包括肿瘤学)中的应用带来了机会。