Laboratory of Physical Chemistry, ETH Zürich , Vladimir-Prelog-Weg 2, 8093 Zürich, Switzerland.
J Chem Inf Model. 2017 Apr 24;57(4):726-741. doi: 10.1021/acs.jcim.6b00778. Epub 2017 Apr 12.
While the use of machine-learning (ML) techniques is well established in cheminformatics for the prediction of physicochemical properties and binding affinities, the training of ML models based on data from molecular dynamics (MD) simulations remains largely unexplored. Here, we present a fingerprint termed MDFP which is constructed from the distributions of properties such as potential-energy components, radius of gyration, and solvent-accessible surface area extracted from MD simulations. The corresponding fingerprint elements are the first two statistical moments of the distributions and the median. By considering not only the average but also the spread of the distribution in the fingerprint, some degree of entropic information is encoded. Short MD simulations of the molecules in water (and in vacuum) are used to generate MDFP. These are further combined with simple counts based on the 2D structure of the molecules into MDFP+. The resulting information-rich MDFP+ is used to train ML models for the prediction of solvation free energies in five different solvents (water, octanol, chloroform, hexadecane, and cyclohexane) as well as partition coefficients in octanol/water, hexadecane/water, and cyclohexane/water. The approach is easy to implement and computationally relatively inexpensive. Yet, it performs similarly well compared to more rigorous MD-based free-energy methods such as free-energy perturbation (FEP) as well as end-state methods such as linear interaction energy (LIE), the conductor-like screening model for realistic solvation (COSMO-RS), and the SMx family of solvation models.
虽然机器学习 (ML) 技术在化学信息学中被广泛应用于预测物理化学性质和结合亲和力,但基于分子动力学 (MD) 模拟数据训练 ML 模型的方法仍在很大程度上未被探索。在这里,我们提出了一种指纹,称为 MDFP,它是由从 MD 模拟中提取的性质分布构建的,例如势能分量、回转半径和溶剂可及表面积。相应的指纹元素是分布的前两个统计矩和中位数。通过不仅考虑指纹中分布的平均值,而且还考虑分布的分散程度,可以编码一定程度的熵信息。在水中(和真空中)对分子进行短 MD 模拟,以生成 MDFP。然后将这些模拟与基于分子二维结构的简单计数相结合,形成 MDFP+。由此产生的信息丰富的 MDFP+ 被用于训练 ML 模型,以预测五种不同溶剂(水、辛醇、氯仿、十六烷和环己烷)中的溶剂化自由能以及辛醇/水、十六烷/水和环己烷/水中的分配系数。该方法易于实现,计算成本相对较低。然而,与更严格的基于 MD 的自由能方法(如自由能微扰 (FEP))以及终态方法(如线性相互作用能 (LIE))、用于实际溶剂化的导体相似性屏蔽模型 (COSMO-RS) 和 SMx 家族的溶剂化模型相比,它的性能相当好。