Department of Chemistry and Chemical Biology, Harvard University , 12 Oxford Street, Cambridge, Massachusetts 02138, United States.
J Chem Inf Model. 2017 Apr 24;57(4):657-668. doi: 10.1021/acs.jcim.6b00332. Epub 2017 Apr 10.
We propose a multiple descriptor multiple kernel (MultiDK) method for efficient molecular discovery using machine learning. We show that the MultiDK method improves both the speed and accuracy of molecular property prediction. We apply the method to the discovery of electrolyte molecules for aqueous redox flow batteries. Using multiple-type-as opposed to single-type-descriptors, we obtain more relevant features for machine learning. Following the principle of "wisdom of the crowds", the combination of multiple-type descriptors significantly boosts prediction performance. Moreover, by employing multiple kernels-more than one kernel function for a set of the input descriptors-MultiDK exploits nonlinear relations between molecular structure and properties better than a linear regression approach. The multiple kernels consist of a Tanimoto similarity kernel and a linear kernel for a set of binary descriptors and a set of nonbinary descriptors, respectively. Using MultiDK, we achieve an average performance of r = 0.92 with a test set of molecules for solubility prediction. We also extend MultiDK to predict pH-dependent solubility and apply it to a set of quinone molecules with different ionizable functional groups to assess their performance as flow battery electrolytes.
我们提出了一种多描述符多核(MultiDK)方法,用于使用机器学习进行有效的分子发现。我们表明,MultiDK 方法提高了分子性质预测的速度和准确性。我们将该方法应用于水相氧化还原流电池电解质分子的发现。通过使用多类型-而不是单类型-描述符,我们获得了更相关的机器学习特征。遵循“群体智慧”的原则,多类型描述符的组合显著提高了预测性能。此外,通过采用多个核-为一组输入描述符使用多个核函数-MultiDK 比线性回归方法更好地利用了分子结构和性质之间的非线性关系。多个核分别由一个 Tanimoto 相似性核和一个用于一组二进制描述符和一组非二进制描述符的线性核组成。使用 MultiDK,我们在溶解度预测的分子测试集上实现了 r = 0.92 的平均性能。我们还将 MultiDK 扩展到预测 pH 依赖性溶解度,并将其应用于一组具有不同可离子化官能团的醌分子,以评估它们作为流电池电解质的性能。