Department of Systems, Synthetic, and Quantitative Biology, Harvard Medical School, Boston, MA, USA2Department of EECS, Massachusetts Institute of Technology, Cambridge, MA, USA*Co-first author.
Pac Symp Biocomput. 2021;26:273-284.
Modeling the relationship between chemical structure and molecular activity is a key goal in drug development. Many benchmark tasks have been proposed for molecular property prediction, but these tasks are generally aimed at specific, isolated biomedical properties. In this work, we propose a new cross-modal small molecule retrieval task, designed to force a model to learn to associate the structure of a small molecule with the transcriptional change it induces. We develop this task formally as multi-view alignment problem, and present a coordinated deep learning approach that jointly optimizes representations of both chemical structure and perturbational gene expression profiles. We benchmark our results against oracle models and principled baselines, and find that cell line variability markedly influences performance in this domain. Our work establishes the feasibility of this new task, elucidates the limitations of current data and systems, and may serve to catalyze future research in small molecule representation learning.
建立化学结构与分子活性之间的关系模型是药物研发的关键目标。已经提出了许多用于分子性质预测的基准任务,但这些任务通常针对特定的、孤立的生物医学性质。在这项工作中,我们提出了一个新的跨模态小分子检索任务,旨在迫使模型学会将小分子的结构与其诱导的转录变化联系起来。我们将这个任务正式地形式化为多视图对齐问题,并提出了一种协调的深度学习方法,该方法联合优化了化学结构和扰动基因表达谱的表示。我们将我们的结果与 oracle 模型和有原则的基准进行了对比,并发现细胞系的变异性显著影响了该领域的性能。我们的工作确立了这个新任务的可行性,阐明了当前数据和系统的局限性,并可能有助于推动小分子表示学习的未来研究。