Fooladi Hosein, Vu Thi Ngoc Lan, Mathea Miriam, Kirchmair Johannes
Department of Pharmaceutical Sciences, Division of Pharmaceutical Chemistry, Faculty of Life Sciences, University of Vienna, Josef-Holaubek-Platz 2, 1090 Vienna, Austria.
Christian Doppler Laboratory for Molecular Informatics in the Biosciences, Department of Pharmaceutical Sciences, University of Vienna, 1090 Vienna, Austria.
J Chem Inf Model. 2025 Sep 15. doi: 10.1021/acs.jcim.5c00475.
Today, machine learning models are employed extensively to predict the physicochemical and biological properties of molecules. Their performance is typically evaluated on in-distribution (ID) data, i.e., data originating from the same distribution as the training data. However, the real-world applications of such models often involve molecules that are more distant from the training data, necessitating the assessment of their performance on out-of-distribution (OOD) data. In this work, we investigate and evaluate the performance of 14 machine learning models, including classical approaches like random forests, as well as graph neural network (GNN) methods, such as message-passing graph neural networks, across eight data sets using ten splitting strategies for OOD data generation. First, we investigate what constitutes OOD data in the molecular domain for bioactivity and ADMET prediction tasks. In contrast to the common point of view, we show that both classical machine learning and GNN models work well (not substantially different from random splitting) on data split based on Bemis-Murcko scaffolds. Splitting based on chemical similarity clustering (UMAP-based clustering using ECFP4 fingerprints) poses the most challenging task for both types of models. Second, we investigate the extent to which ID and OOD performance have a positive linear relationship. If a positive correlation holds, models with the best performance on the ID data can be selected with the promise of having the best performance on OOD data. We show that the strength of this linear relationship is strongly related to how the OOD data is generated, i.e., which splitting strategies are used for generating OOD data. While the correlation between ID and OOD performance for scaffold splitting is strong (Pearson's ∼ 0.9), this correlation decreases significantly for all the cluster-based splitting (Pearson's ∼ 0.4). Therefore, the relationship can be more nuanced, and a strong positive correlation is not guaranteed for all OOD scenarios. These findings suggest that OOD performance evaluation and model selection should be carefully aligned with the intended application domain.
如今,机器学习模型被广泛用于预测分子的物理化学和生物学性质。其性能通常在分布内(ID)数据上进行评估,即源自与训练数据相同分布的数据。然而,此类模型在现实世界中的应用往往涉及与训练数据差异较大的分子,因此有必要评估它们在分布外(OOD)数据上的性能。在这项工作中,我们研究并评估了14种机器学习模型的性能,包括随机森林等经典方法,以及图神经网络(GNN)方法,如消息传递图神经网络,使用十种用于生成OOD数据的分割策略,跨越八个数据集进行评估。首先,我们研究在生物活性和ADMET预测任务的分子领域中,什么构成了OOD数据。与普遍观点不同,我们表明,基于Bemis-Murcko骨架进行数据分割时,经典机器学习模型和GNN模型都表现良好(与随机分割没有实质性差异)。基于化学相似性聚类(使用ECFP4指纹的基于UMAP的聚类)进行分割对这两种类型的模型来说都是最具挑战性的任务。其次,我们研究ID性能和OOD性能在多大程度上具有正线性关系。如果存在正相关关系,那么可以选择在ID数据上表现最佳的模型,有望在OOD数据上也具有最佳性能。我们表明,这种线性关系的强度与OOD数据的生成方式密切相关,即用于生成OOD数据的分割策略。虽然基于骨架分割的ID性能和OOD性能之间的相关性很强(皮尔逊相关系数约为0.9),但对于所有基于聚类的分割,这种相关性会显著降低(皮尔逊相关系数约为0.4)。因此,这种关系可能更加微妙,并非所有OOD场景都能保证有很强的正相关关系。这些发现表明,OOD性能评估和模型选择应与预期的应用领域仔细匹配。