Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States.
J Chem Inf Model. 2023 Aug 14;63(15):4574-4588. doi: 10.1021/acs.jcim.3c00546. Epub 2023 Jul 24.
Knowledge of critical properties, such as critical temperature, pressure, density, as well as acentric factor, is essential to calculate thermo-physical properties of chemical compounds. Experiments to determine critical properties and acentric factors are expensive and time intensive; therefore, we developed a machine learning (ML) model that can predict these molecular properties given the SMILES representation of a chemical species. We explored directed message passing neural network (D-MPNN) and graph attention network as ML architecture choices. Additionally, we investigated featurization with additional atomic and molecular features, multitask training, and pretraining using estimated data to optimize model performance. Our final model utilizes a D-MPNN layer to learn the molecular representation and is supplemented by Abraham parameters. A multitask training scheme was used to train a single model to predict all the critical properties and acentric factors along with boiling point, melting point, enthalpy of vaporization, and enthalpy of fusion. The model was evaluated on both random and scaffold splits where it shows state-of-the-art accuracies. The extensive data set of critical properties and acentric factors contains 1144 chemical compounds and is made available in the public domain together with the source code that can be used for further exploration.
临界性质(如临界温度、压力、密度以及偏心因子)的知识对于计算化合物的热物理性质至关重要。实验测定临界性质和偏心因子的成本高且耗时;因此,我们开发了一种机器学习(ML)模型,可基于化学物质的 SMILES 表示来预测这些分子性质。我们探索了有向消息传递神经网络(D-MPNN)和图注意网络作为 ML 架构选择。此外,我们还研究了使用附加原子和分子特征、多任务训练和使用估计数据进行预训练的特征化,以优化模型性能。我们的最终模型利用 D-MPNN 层来学习分子表示,并辅以 Abraham 参数。我们使用多任务训练方案来训练单个模型,以预测所有临界性质和偏心因子,以及沸点、熔点、蒸发热和熔融热。该模型在随机和支架拆分上进行了评估,表现出了最先进的准确性。临界性质和偏心因子的广泛数据集包含 1144 种化学物质,并在公共领域提供,同时提供可用于进一步探索的源代码。