Harvard University, Department of Chemistry and Chemical Biology, Cambridge, MA, 02138, USA.
Massachusetts Institute of Technology, Department of Materials Science and Engineering, Cambridge, MA, 02139, USA.
Sci Data. 2022 Apr 21;9(1):185. doi: 10.1038/s41597-022-01288-4.
Machine learning (ML) outperforms traditional approaches in many molecular design tasks. ML models usually predict molecular properties from a 2D chemical graph or a single 3D structure, but neither of these representations accounts for the ensemble of 3D conformers that are accessible to a molecule. Property prediction could be improved by using conformer ensembles as input, but there is no large-scale dataset that contains graphs annotated with accurate conformers and experimental data. Here we use advanced sampling and semi-empirical density functional theory (DFT) to generate 37 million molecular conformations for over 450,000 molecules. The Geometric Ensemble Of Molecules (GEOM) dataset contains conformers for 133,000 species from QM9, and 317,000 species with experimental data related to biophysics, physiology, and physical chemistry. Ensembles of 1,511 species with BACE-1 inhibition data are also labeled with high-quality DFT free energies in an implicit water solvent, and 534 ensembles are further optimized with DFT. GEOM will assist in the development of models that predict properties from conformer ensembles, and generative models that sample 3D conformations.
机器学习(ML)在许多分子设计任务中表现优于传统方法。ML 模型通常根据 2D 化学图或单个 3D 结构预测分子性质,但这两种表示形式都无法解释分子可获得的整套 3D 构象。通过使用构象集合作为输入,可以提高性质预测的准确性,但目前还没有包含带有准确构象和实验数据注释的图形的大规模数据集。在这里,我们使用高级采样和半经验密度泛函理论(DFT)为超过 450,000 种分子生成了 3700 万个分子构象。几何分子集合(GEOM)数据集包含来自 QM9 的 133,000 种物质的构象,以及 317,000 种具有与生物物理学、生理学和物理化学相关的实验数据的物质的构象。还对具有 BACE-1 抑制数据的 1511 种物质的集合进行了带有高质量隐式水溶剂中 DFT 自由能的标记,并且对 534 种集合进行了 DFT 进一步优化。GEOM 将有助于开发从构象集合预测性质的模型,以及从 3D 构象进行生成的模型。