Abranches Dinis O, Maginn Edward J, Colón Yamil J
Department of Chemical and Biomolecular Engineering, University of Notre Dame, Notre Dame, IN 46556.
Proc Natl Acad Sci U S A. 2024 Jul 30;121(31):e2404676121. doi: 10.1073/pnas.2404676121. Epub 2024 Jul 23.
This work establishes a different paradigm on digital molecular spaces and their efficient navigation by exploiting sigma profiles. To do so, the remarkable capability of Gaussian processes (GPs), a type of stochastic machine learning model, to correlate and predict physicochemical properties from sigma profiles is demonstrated, outperforming state-of-the-art neural networks previously published. The amount of chemical information encoded in sigma profiles eases the learning burden of machine learning models, permitting the training of GPs on small datasets which, due to their negligible computational cost and ease of implementation, are ideal models to be combined with optimization tools such as gradient search or Bayesian optimization (BO). Gradient search is used to efficiently navigate the sigma profile digital space, quickly converging to local extrema of target physicochemical properties. While this requires the availability of pretrained GP models on existing datasets, such limitations are eliminated with the implementation of BO, which can find global extrema with a limited number of iterations. A remarkable example of this is that of BO toward boiling temperature optimization. Holding no knowledge of chemistry except for the sigma profile and boiling temperature of carbon monoxide (the worst possible initial guess), BO finds the global maximum of the available boiling temperature dataset (over 1,000 molecules encompassing more than 40 families of organic and inorganic compounds) in just 15 iterations (i.e., 15 property measurements), cementing sigma profiles as a powerful digital chemical space for molecular optimization and discovery, particularly when little to no experimental data is initially available.
这项工作通过利用西格玛谱建立了关于数字分子空间及其高效导航的不同范式。为此,展示了高斯过程(GPs)这种随机机器学习模型从西格玛谱关联和预测物理化学性质的卓越能力,其性能优于先前发表的最先进神经网络。西格玛谱中编码的化学信息量减轻了机器学习模型的学习负担,使得能够在小数据集上训练高斯过程,由于其计算成本可忽略不计且易于实现,高斯过程是与梯度搜索或贝叶斯优化(BO)等优化工具相结合的理想模型。梯度搜索用于在西格玛谱数字空间中高效导航,快速收敛到目标物理化学性质的局部极值。虽然这需要在现有数据集上有预训练的高斯过程模型,但通过实施贝叶斯优化消除了这些限制,贝叶斯优化可以通过有限次数的迭代找到全局极值。一个显著的例子是贝叶斯优化用于沸点优化。除了一氧化碳的西格玛谱和沸点(最糟糕的初始猜测)之外对化学知识一无所知,贝叶斯优化在仅15次迭代(即15次性质测量)中就找到了可用沸点数据集(包含40多个有机和无机化合物家族的1000多个分子)的全局最大值,巩固了西格玛谱作为用于分子优化和发现的强大数字化学空间的地位,特别是在最初几乎没有实验数据可用的情况下。