Nevolianis Thomas, Rittig Jan G, Mitsos Alexander, Leonhard Kai
Institute of Technical Thermodynamics, RWTH Aachen University, 52062, Aachen, Germany.
Process Systems Engineering, RWTH Aachen University, 52074, Aachen, Germany.
J Cheminform. 2025 Aug 8;17(1):123. doi: 10.1186/s13321-025-01057-6.
Accurate prediction of toluene/water partition coefficients of neutral species is crucial in drug discovery and separation processes; however, data-driven modeling of these coefficients remains challenging due to limited available experimental data. To address the limitation of available data, we apply multi-fidelity learning approaches leveraging a quantum chemical dataset (low fidelity) of approximately 9000 entries generated by COSMO-RS and an experimental dataset (high fidelity) of about 250 entries collected from the literature. We explore the transfer learning, feature-augmented learning, and multi-target learning approaches in combination with graph neural networks, validating them on two external datasets: one with molecules similar to training data (EXT-Zamora) and one with more challenging molecules (EXT-SAMPL9). Our results show that multi-target learning significantly improves predictive accuracy, achieving a root-mean-square error of 0.44 units for the EXT-Zamora, compared to a root-mean-square error of 0.63 units for single-task models. For the EXT-SAMPL9 dataset, multi-target learning achieves a root-mean-square error of 1.02 units, indicating reasonable performance even for more complex molecular structures. These findings highlight the potential of multi-fidelity learning approaches that leverage quantum chemical data to improve toluene/water partition coefficient predictions and address challenges posed by limited experimental data. We expect the applicability of the methods used beyond just toluene/water partition coefficients.
准确预测中性物种的甲苯/水分配系数在药物发现和分离过程中至关重要;然而,由于可用的实验数据有限,对这些系数进行数据驱动的建模仍然具有挑战性。为了解决可用数据的局限性,我们应用多保真度学习方法,利用由COSMO-RS生成的约9000个条目的量子化学数据集(低保真度)和从文献中收集的约250个条目的实验数据集(高保真度)。我们结合图神经网络探索迁移学习、特征增强学习和多目标学习方法,并在两个外部数据集上对其进行验证:一个数据集包含与训练数据相似的分子(EXT-Zamora),另一个数据集包含更具挑战性的分子(EXT-SAMPL9)。我们的结果表明,多目标学习显著提高了预测准确性,EXT-Zamora的均方根误差为0.44个单位,而单任务模型的均方根误差为0.63个单位。对于EXT-SAMPL9数据集,多目标学习的均方根误差为1.02个单位,这表明即使对于更复杂的分子结构,其性能也是合理的。这些发现突出了利用量子化学数据的多保真度学习方法在改善甲苯/水分配系数预测以及应对有限实验数据带来的挑战方面的潜力。我们期望所使用的方法不仅适用于甲苯/水分配系数。