Ha Son V, Jaensch Steffen, Kańduła Maciej M, Herman Dorota, Czodrowski Paul, Ceulemans Hugo
Johnson & Johnson, Beerse, Belgium.
Department of Chemistry, Johannes Gutenberg University Mainz, Mainz, Germany.
Sci Rep. 2025 Jul 2;15(1):23010. doi: 10.1038/s41598-025-05914-0.
In drug discovery, different data modalities (chemical structure, cell biology, quantum mechanics, etc.) are abundant, and their integration can help with understanding aspects of chemistry, biology, and their interactions. Within cell biology, cell painting (CP) and transcriptomics RNA-Seq (TX) screens are powerful tools in early drug discovery, as they are complementary views of the biological effect of compounds on a population of cells post-treatment. While multimodal learning of chemical structure-cell painting, or different omics data has been experimented; a cell painting-bulk transcriptomics multimodal model is still unexplored. In this work, we benchmark two representation learning methods: contrastive learning and bimodal autoencoder. We use the setting of cross modality learning where representation learning is performed with two modalities (CP and TX), but only cell painting is available for new compounds embeddings generation and downstream task. This is because for new compounds, we would only have CP data and not TX, due to high data generation cost of the RNA-Seq screen. We show that in the absence of TX features for new compounds, using learned embeddings like those obtained from Constrastive Learning enhances performance of CP features on tasks where TX features excels but CP features does not. Additionally, we observed that learned representation improves cluster quality for clustering of CP replicates and different mechanisms of action (MoA), as well as improves performance on several subsets of bioactivity tasks grouped by protein target families.
在药物发现中,不同的数据模式(化学结构、细胞生物学、量子力学等)丰富多样,它们的整合有助于理解化学、生物学及其相互作用的各个方面。在细胞生物学领域,细胞绘画(CP)和转录组学RNA测序(TX)筛选是早期药物发现中的强大工具,因为它们是化合物在处理后对细胞群体的生物学效应的互补视角。虽然已经对化学结构-细胞绘画或不同组学数据的多模态学习进行了实验,但细胞绘画-批量转录组学多模态模型仍未被探索。在这项工作中,我们对两种表示学习方法进行了基准测试:对比学习和双峰自动编码器。我们使用跨模态学习设置,其中表示学习通过两种模式(CP和TX)进行,但只有细胞绘画可用于生成新化合物的嵌入以及下游任务。这是因为对于新化合物,由于RNA测序筛选的数据生成成本高,我们只有CP数据而没有TX数据。我们表明,在新化合物没有TX特征的情况下,使用从对比学习中获得的那种学习到的嵌入可以提高CP特征在TX特征擅长但CP特征不擅长的任务上的性能。此外,我们观察到学习到的表示提高了CP重复样本和不同作用机制(MoA)聚类的簇质量,以及提高了按蛋白质靶标家族分组的生物活性任务的几个子集上的性能。