Department of Chemistry, Centre for Molecular Informatics, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, U.K.
Hit Discovery, Discovery Sciences, R&D, AstraZeneca, Cambridge CB4 0WG, U.K.
J Chem Inf Model. 2021 Mar 22;61(3):1444-1456. doi: 10.1021/acs.jcim.0c00864. Epub 2021 Mar 4.
The understanding of the mechanism-of-action (MoA) of compounds and the prediction of potential drug targets play an important role in small-molecule drug discovery. The aim of this work was to compare chemical and cell morphology information for bioactivity prediction. The comparison was performed using bioactivity data from the ExCAPE database, image data (in the form of CellProfiler features) from the Cell Painting data set (the largest publicly available data set of cell images with ∼30,000 compound perturbations), and extended connectivity fingerprints (ECFPs) using the multitask Bayesian matrix factorization (BMF) approach Macau. We found that the BMF Macau and random forest (RF) performance were overall similar when ECFPs were used as compound descriptors. However, BMF Macau outperformed RF in 159 out of 224 targets (71%) when image data were used as compound information. Using BMF Macau, 100 (corresponding to about 45%) and 90 (about 40%) of the 224 targets were predicted with high predictive performance (AUC > 0.8) with ECFP data and image data as side information, respectively. There were targets better predicted by image data as side information, such as β-catenin, and others better predicted by fingerprint-based side information, such as proteins belonging to the G-protein-Coupled Receptor 1 family, which could be rationalized from the underlying data distributions in each descriptor domain. In conclusion, both cell morphology changes and chemical structure information contain information about compound bioactivity, which is also partially complementary, and can hence contribute to MoA analysis.
化合物作用机制(MoA)的理解和潜在药物靶点的预测在小分子药物发现中起着重要作用。本工作旨在比较化学和细胞形态信息以进行生物活性预测。比较使用了来自 ExCAPE 数据库的生物活性数据、来自 Cell Painting 数据集的图像数据(以 CellProfiler 特征的形式)(该数据集是具有约 30,000 个化合物扰动的最大公开可用细胞图像数据集)和扩展连接指纹(ECFPs),使用 multitask Bayesian matrix factorization (BMF) 方法 Macau。我们发现,当使用 ECFPs 作为化合物描述符时,BMF Macau 和随机森林(RF)的性能总体上相似。然而,当将图像数据用作化合物信息时,BMF Macau 在 224 个目标中的 159 个(71%)中表现优于 RF。使用 BMF Macau,在 ECFP 数据和图像数据作为辅助信息的情况下,224 个目标中的 100 个(约 45%)和 90 个(约 40%)被预测具有高预测性能(AUC > 0.8)。存在一些目标,例如 β-catenin,使用图像数据作为辅助信息可得到更好的预测,而另一些目标,例如属于 G-蛋白偶联受体 1 家族的蛋白质,使用基于指纹的辅助信息可得到更好的预测,这可以从每个描述符域中的基础数据分布中得到合理化。总之,细胞形态变化和化学结构信息都包含化合物生物活性的信息,这些信息在某种程度上是互补的,因此可以为作用机制分析做出贡献。