Data Science and Modelling, Pharmaceutical Sciences, R & D, AstraZeneca, Gothenburg, Sweden.
Phys Chem Chem Phys. 2022 May 4;24(17):10599-10610. doi: 10.1039/d2cp01165d.
We present the open-source framework that enables the efficient and robust calculation of quantum mechanical features for atoms and molecules. For a benchmark set of 49 experimental molecular polarizabilities, the predictive power of the presented method competes against second-order perturbation theory in a converged atomic-orbital basis set at a fraction of its computational costs. The calculation of isotropic molecular polarizabilities is robust for a data set of more than 80 000 molecules. We present furthermore a generally applicable van der Waals radius model that is rooted on atomic static polarizabilites. Efficiency tests show that such radii can even be calculated for small- to medium-size proteins where the largest system (SARS-CoV-2 spike protein) has 42 539 atoms. Following the work of Domingo-Alemenara [Domingo-Alemenara , 2019, , 5811], we present computational predictions for retention times for different chromatographic methods and describe how physicochemical features improve the predictive power of machine-learning models that otherwise only rely on two-dimensional features like molecular fingerprints. Additionally, we developed an internal benchmark set of experimental super-critical fluid chromatography retention times. For those methods, improvements of up to 10.6% are obtained when combining molecular fingerprints with physicochemical descriptors. Shapley additive explanation values show furthermore that the physical nature of the applied features can be retained within the final machine-learning models. We generally recommend the framework as a robust, low-cost, and physically motivated featurizer for upcoming state-of-the-art machine-learning studies.
我们提出了一个开源框架,能够高效、稳健地计算原子和分子的量子力学特性。对于一组 49 个实验分子极化率的基准集,所提出方法的预测能力在收敛原子轨道基组中与二级微扰理论相竞争,但其计算成本仅为其一小部分。各向同性分子极化率的计算对于超过 80000 个分子的数据集是稳健的。我们还提出了一种普遍适用的范德华半径模型,该模型基于原子静态极化率。效率测试表明,对于中小尺寸的蛋白质,甚至可以计算出这样的半径,其中最大的系统(SARS-CoV-2 刺突蛋白)有 42539 个原子。遵循 Domingo-Alemenara 的工作 [Domingo-Alemenara ,2019 , ,5811],我们为不同的色谱方法预测了保留时间,并描述了物理化学特征如何提高仅依赖于二维特征(如分子指纹)的机器学习模型的预测能力。此外,我们还开发了一个实验超临界流体色谱保留时间的内部基准集。对于这些方法,当将分子指纹与物理化学描述符结合使用时,可以将保留时间提高多达 10.6%。Shapley 加法解释值还表明,所应用特征的物理性质可以保留在最终的机器学习模型中。我们一般建议将 框架作为一种稳健、低成本且具有物理意义的特征提取器,用于即将到来的最先进的机器学习研究。