基于神经指纹模型的不确定性分析。

Analysis of uncertainty of neural fingerprint-based models.

作者信息

Feldmann Christian W, Sieg Jochen, Mathea Miriam

机构信息

BASF SE, Ludwigshafen, Germany.

出版信息

Faraday Discuss. 2025 Jan 14;256(0):551-567. doi: 10.1039/d4fd00095a.

DOI:10.1039/d4fd00095a

PMID:39320108

Abstract

Machine learning has gained popularity for predicting molecular properties based on molecular structure. This study explores the uncertainty estimates of neural fingerprint-based models by comparing pure graph neural networks (GNN) to classical machine learning algorithms combined with neural fingerprints. We investigate the advantage of extracting the neural fingerprint from the GNN and integrating it into a method known for producing better-calibrated probability estimates. Comparisons are made using three classical machine learning methods and the Chemprop model, considering different molecular representations and calibration techniques. We utilize 19 datasets from Toxcast, reflecting real-world scenarios with balanced accuracies ranging from 0.6 to 0.8. Results demonstrate that neural fingerprints combined with classical machine learning methods exhibit a slight decrease in prediction performance compared to the native Chemprop model. However, these models provide significantly improved uncertainty estimates. Notably, uncertainty estimates of neural fingerprint-based methods remain relatively robust for molecules dissimilar to the training set. This suggests that methods like random forest with neural fingerprints can deliver strong prediction performance and reliable uncertainty estimates. When considering both performance and uncertainty, the calibrated Chemprop model and the combination of neural fingerprints with random forest or support vector classifier (SVC) yield comparable results. Surprisingly, the SVC method shows promising performance when combined with neural or count fingerprints. These findings are particularly relevant in real-world industrial projects where accurate predictions and reliable uncertainty estimates are crucial.

摘要

机器学习已在基于分子结构预测分子性质方面受到欢迎。本研究通过将纯图神经网络（GNN）与结合神经指纹的经典机器学习算法进行比较，探索基于神经指纹的模型的不确定性估计。我们研究了从GNN中提取神经指纹并将其整合到一种以产生校准效果更好的概率估计而闻名的方法中的优势。使用三种经典机器学习方法和Chemprop模型进行比较，考虑不同的分子表示和校准技术。我们利用来自Toxcast的19个数据集，反映了平衡准确率在0.6到0.8之间的现实场景。结果表明，与原生Chemprop模型相比，结合经典机器学习方法的神经指纹在预测性能上略有下降。然而，这些模型提供了显著改进的不确定性估计。值得注意的是，基于神经指纹的方法的不确定性估计对于与训练集不同的分子仍然相对稳健。这表明像带有神经指纹的随机森林这样的方法可以提供强大的预测性能和可靠的不确定性估计。在同时考虑性能和不确定性时，校准后的Chemprop模型以及神经指纹与随机森林或支持向量分类器（SVC）的组合产生了可比的结果。令人惊讶的是，SVC方法与神经指纹或计数指纹结合时表现出良好的性能。这些发现对于现实世界中的工业项目尤为重要，在这些项目中准确的预测和可靠的不确定性估计至关重要。