Pham Thai-Hoang, Xie Lei, Zhang Ping
Department of Computer Science and Engineering, The Ohio State University, Columbus, USA.
Department of Computer Science, Hunter College, The City University of New York, New York City, USA; Neuroscience, Weill Cornell Medicine, New York City, USA.
Proc SIAM Int Conf Data Min. 2022;2022:720-728. doi: 10.1137/1.9781611977172.81.
molecular design is a key challenge in drug discovery due to the complexity of chemical space. With the availability of molecular datasets and advances in machine learning, many deep generative models are proposed for generating novel molecules with desired properties. However, most of the existing models focus only on molecular distribution learning and target-based molecular design, thereby hindering their potentials in real-world applications. In drug discovery, phenotypic molecular design has advantages over target-based molecular design, especially in first-in-class drug discovery. In this work, we propose the first deep graph generative model (FAME) targeting phenotypic molecular design, in particular gene expression-based molecular design. FAME leverages a conditional variational autoencoder framework to learn the conditional distribution generating molecules from gene expression profiles. However, this distribution is difficult to learn due to the complexity of the molecular space and the noisy phenomenon in gene expression data. To tackle these issues, a gene expression denoising (GED) model that employs contrastive objective function is first proposed to reduce noise from gene expression data. FAME is then designed to treat molecules as the sequences of fragments and learn to generate these fragments in autoregressive manner. By leveraging this fragment-based generation strategy and the denoised gene expression profiles, FAME can generate novel molecules with a high validity rate and desired biological activity. The experimental results show that FAME outperforms existing methods including both SMILES-based and graph-based deep generative models for phenotypic molecular design. Furthermore, the effective mechanism for reducing noise in gene expression data proposed in our study can be applied to omics data modeling in general for facilitating phenotypic drug discovery.
由于化学空间的复杂性,分子设计是药物研发中的一项关键挑战。随着分子数据集的可得性以及机器学习的进展,人们提出了许多深度生成模型来生成具有所需特性的新型分子。然而,现有的大多数模型仅专注于分子分布学习和基于靶点的分子设计,从而限制了它们在实际应用中的潜力。在药物研发中,表型分子设计相较于基于靶点的分子设计具有优势,尤其是在首创药物研发中。在这项工作中,我们提出了首个针对表型分子设计,特别是基于基因表达的分子设计的深度图生成模型(FAME)。FAME利用条件变分自编码器框架从基因表达谱中学习生成分子的条件分布。然而,由于分子空间的复杂性和基因表达数据中的噪声现象,这种分布很难学习。为了解决这些问题,首先提出了一种采用对比目标函数的基因表达去噪(GED)模型来减少基因表达数据中的噪声。然后,FAME被设计为将分子视为片段序列,并以自回归方式学习生成这些片段。通过利用这种基于片段的生成策略和去噪后的基因表达谱,FAME可以生成具有高有效率和所需生物活性的新型分子。实验结果表明,FAME在表型分子设计方面优于包括基于SMILES和基于图的深度生成模型在内的现有方法。此外,我们研究中提出的减少基因表达数据噪声的有效机制一般可应用于组学数据建模,以促进表型药物研发。