Department of Computer Science and Technology, University of Cambridge, Cambridge, UK.
Molecular AI, BioPharmaceuticals R&D, AstraZeneca, Gothenburg, Sweden.
Nat Commun. 2024 Feb 26;15(1):1517. doi: 10.1038/s41467-024-45566-8.
We investigate the potential of graph neural networks for transfer learning and improving molecular property prediction on sparse and expensive to acquire high-fidelity data by leveraging low-fidelity measurements as an inexpensive proxy for a targeted property of interest. This problem arises in discovery processes that rely on screening funnels for trading off the overall costs against throughput and accuracy. Typically, individual stages in these processes are loosely connected and each one generates data at different scale and fidelity. We consider this setup holistically and demonstrate empirically that existing transfer learning techniques for graph neural networks are generally unable to harness the information from multi-fidelity cascades. Here, we propose several effective transfer learning strategies and study them in transductive and inductive settings. Our analysis involves a collection of more than 28 million unique experimental protein-ligand interactions across 37 targets from drug discovery by high-throughput screening and 12 quantum properties from the dataset QMugs. The results indicate that transfer learning can improve the performance on sparse tasks by up to eight times while using an order of magnitude less high-fidelity training data. Moreover, the proposed methods consistently outperform existing transfer learning strategies for graph-structured data on drug discovery and quantum mechanics datasets.
我们研究了图神经网络在迁移学习方面的潜力,通过利用低保真度测量作为目标感兴趣属性的廉价代理,来提高对稀疏且昂贵的高保真度数据的分子性质预测能力。在依赖筛选漏斗来平衡总成本、吞吐量和准确性的发现过程中,会出现这个问题。通常,这些过程中的各个阶段都是松散连接的,每个阶段都会在不同的规模和保真度上生成数据。我们全面考虑了这种设置,并通过经验证明,现有的图神经网络迁移学习技术通常无法利用多保真度级联中的信息。在这里,我们提出了几种有效的迁移学习策略,并在传导和归纳设置中对它们进行了研究。我们的分析涉及了高通量筛选药物发现中来自 37 个靶标的超过 2800 万个独特的实验蛋白质-配体相互作用,以及来自数据集 QMugs 的 12 个量子性质。结果表明,迁移学习可以在使用数量级少得多的高保真训练数据的情况下,将稀疏任务的性能提高多达 8 倍。此外,在所研究的药物发现和量子力学数据集上,所提出的方法在图结构数据的迁移学习策略方面始终表现出色。