Ghosh Kunal, Todorović Milica, Vehtari Aki, Rinke Patrick
Department of Applied Physics, Aalto University, P.O. Box 11000, FI-00076 Aalto, Finland.
Department of Computer Science, Aalto University, P.O. Box 15400, FI-00076 Aalto, Finland.
J Chem Phys. 2025 Jan 7;162(1). doi: 10.1063/5.0229834.
Active learning (AL) has shown promise to be a particularly data-efficient machine learning approach. Yet, its performance depends on the application, and it is not clear when AL practitioners can expect computational savings. Here, we carry out a systematic AL performance assessment for three diverse molecular datasets and two common scientific tasks: compiling compact, informative datasets and targeted molecular searches. We implemented AL with Gaussian processes (GP) and used the many-body tensor as molecular representation. For the first task, we tested different data acquisition strategies, batch sizes, and GP noise settings. AL was insensitive to the acquisition batch size, and we observed the best AL performance for the acquisition strategy that combines uncertainty reduction with clustering to promote diversity. However, for optimal GP noise settings, AL did not outperform the randomized selection of data points. Conversely, for targeted searches, AL outperformed random sampling and achieved data savings of up to 64%. Our analysis provides insight into this task-specific performance difference in terms of target distributions and data collection strategies. We established that the performance of AL depends on the relative distribution of the target molecules in comparison to the total dataset distribution, with the largest computational savings achieved when their overlap is minimal.
主动学习(AL)已显示出有望成为一种特别数据高效的机器学习方法。然而,其性能取决于应用场景,并且尚不清楚主动学习从业者何时能够期待节省计算资源。在此,我们针对三个不同的分子数据集和两项常见的科学任务开展了系统的主动学习性能评估:编译紧凑、信息丰富的数据集以及靶向分子搜索。我们使用高斯过程(GP)实现了主动学习,并将多体张量用作分子表示。对于第一项任务,我们测试了不同的数据采集策略、批次大小和高斯过程噪声设置。主动学习对采集批次大小不敏感,并且我们观察到,对于将不确定性降低与聚类相结合以促进多样性的采集策略,主动学习性能最佳。然而,对于最优的高斯过程噪声设置,主动学习并未优于随机选择数据点。相反,对于靶向搜索,主动学习优于随机采样,并且实现了高达64%的数据节省。我们的分析从目标分布和数据收集策略方面深入了解了这种特定任务的性能差异。我们确定,主动学习的性能取决于目标分子相对于整个数据集分布的相对分布,当它们的重叠最小时,可实现最大的计算节省。