Chandy Sruthy K, Raghavachari Krishnan
Department of Chemistry, Indiana University, Bloomington, Indiana 47405, United States.
J Chem Theory Comput. 2023 Oct 10;19(19):6632-6642. doi: 10.1021/acs.jctc.3c00563. Epub 2023 Sep 13.
We developed a random forest machine learning (ML) model for the prediction of H and C NMR chemical shifts of nucleic acids. Our ML model is trained entirely on reproducing computed chemical shifts obtained previously on 10 nucleic acids using a Molecules-in-Molecules (MIM) fragment-based density functional theory (DFT) protocol including microsolvation effects. Our ML model includes structural descriptors as well as electronic descriptors from an inexpensive low-level semiempirical calculation (GFN2-xTB) and trained on a relatively small number of DFT chemical shifts (2080 H chemical shifts and 1780 C chemical shifts on the 10 nucleic acids). The ML model is then used to make chemical shift predictions on 8 new nucleic acids ranging in size from 600 to 900 atoms and compared directly to experimental data. Though no experimental data was used in the training, the performance of our model is excellent (mean absolute deviation of 0.34 ppm for H chemical shifts and 2.52 ppm for C chemical shifts for the test set), despite having some nonstandard structures. A simple analysis suggests that both structural and electronic descriptors are critical for achieving reliable predictions. This is the first attempt to combine ML from fragment-based DFT calculations to predict experimental chemical shifts accurately, making the MIM-ML model a valuable tool for NMR predictions of nucleic acids.
我们开发了一种随机森林机器学习(ML)模型,用于预测核酸的氢(H)和碳(C)核磁共振化学位移。我们的ML模型完全基于之前使用包含微溶剂化效应的分子内分子(MIM)片段基密度泛函理论(DFT)协议对10种核酸计算得到的化学位移进行训练。我们的ML模型包括结构描述符以及来自廉价的低水平半经验计算(GFN2-xTB)的电子描述符,并基于相对较少数量的DFT化学位移(10种核酸上的2080个H化学位移和1780个C化学位移)进行训练。然后,该ML模型用于对8种大小在600至900个原子之间的新核酸进行化学位移预测,并直接与实验数据进行比较。尽管在训练中未使用实验数据,但我们的模型性能优异(测试集的H化学位移平均绝对偏差为0.34 ppm,C化学位移平均绝对偏差为2.52 ppm),尽管存在一些非标准结构。简单分析表明,结构和电子描述符对于实现可靠预测都至关重要。这是首次尝试将基于片段的DFT计算的ML结合起来以准确预测实验化学位移,使得MIM-ML模型成为核酸核磁共振预测的有价值工具。