Department of Pharmaceutical Sciences, Division of Pharmaceutical Chemistry, Faculty of Life Sciences, University of Vienna, Josef-Holaubek-Platz 2, 1090 Vienna, Austria.
Christian Doppler Laboratory for Molecular Informatics in the Biosciences, Department for Pharmaceutical Sciences, University of Vienna, 1090 Vienna, Austria.
J Chem Inf Model. 2024 May 27;64(10):4031-4046. doi: 10.1021/acs.jcim.4c00160. Epub 2024 May 13.
Today, machine learning methods are widely employed in drug discovery. However, the chronic lack of data continues to hamper their further development, validation, and application. Several modern strategies aim to mitigate the challenges associated with data scarcity by learning from data on related tasks. These knowledge-sharing approaches encompass transfer learning, multitask learning, and meta-learning. A key question remaining to be answered for these approaches is about the extent to which their performance can benefit from the relatedness of available source (training) tasks; in other words, how difficult ("hard") a test task is to a model, given the available source tasks. This study introduces a new method for quantifying and predicting the hardness of a bioactivity prediction task based on its relation to the available training tasks. The approach involves the generation of protein and chemical representations and the calculation of distances between the bioactivity prediction task and the available training tasks. In the example of meta-learning on the FS-Mol data set, we demonstrate that the proposed task hardness metric is inversely correlated with performance (Pearson's correlation coefficient = -0.72). The metric will be useful in estimating the task-specific gain in performance that can be achieved through meta-learning.
如今,机器学习方法在药物发现中得到了广泛应用。然而,数据的长期缺乏仍然阻碍了它们的进一步发展、验证和应用。几种现代策略旨在通过从相关任务的数据中学习来减轻与数据稀缺相关的挑战。这些知识共享方法包括迁移学习、多任务学习和元学习。对于这些方法,仍然需要回答的一个关键问题是,它们的性能在多大程度上可以受益于可用源(训练)任务的相关性;换句话说,给定可用的源任务,模型对测试任务的难度(“困难”)如何。本研究提出了一种新的方法,用于根据生物活性预测任务与可用训练任务的关系来量化和预测该任务的难度。该方法涉及生成蛋白质和化学表示,并计算生物活性预测任务与可用训练任务之间的距离。在 FS-Mol 数据集上的元学习示例中,我们证明了所提出的任务难度度量与性能呈负相关(Pearson 相关系数 = -0.72)。该度量将有助于估计通过元学习可以实现的特定任务的性能增益。