Department of Chemistry and Applied Biosciences, RETHINK, ETH Zurich, 8093, Zurich, Switzerland.
Department of Medicinal Chemistry, Boehringer Ingelheim Pharma GmbH & Co. KG, Birkendorfer Straße 65, 88397, Biberach an der Riss, Germany.
Sci Data. 2022 Jun 7;9(1):273. doi: 10.1038/s41597-022-01390-7.
Machine learning approaches in drug discovery, as well as in other areas of the chemical sciences, benefit from curated datasets of physical molecular properties. However, there currently is a lack of data collections featuring large bioactive molecules alongside first-principle quantum chemical information. The open-access QMugs (Quantum-Mechanical Properties of Drug-like Molecules) dataset fills this void. The QMugs collection comprises quantum mechanical properties of more than 665 k biologically and pharmacologically relevant molecules extracted from the ChEMBL database, totaling ~2 M conformers. QMugs contains optimized molecular geometries and thermodynamic data obtained via the semi-empirical method GFN2-xTB. Atomic and molecular properties are provided on both the GFN2-xTB and on the density-functional levels of theory (DFT, ωB97X-D/def2-SVP). QMugs features molecules of significantly larger size than previously-reported collections and comprises their respective quantum mechanical wave functions, including DFT density and orbital matrices. This dataset is intended to facilitate the development of models that learn from molecular data on different levels of theory while also providing insight into the corresponding relationships between molecular structure and biological activity.
机器学习方法在药物发现领域以及化学科学的其他领域都受益于经过精心整理的物理分子性质数据集。然而,目前缺乏同时具有大型生物活性分子和第一性原理量子化学信息的数据集。开放获取的 QMugs(药物样分子的量子力学性质)数据集填补了这一空白。QMugs 数据集包含了从 ChEMBL 数据库中提取的超过 66.5 万个具有生物和药理学相关性的分子的量子力学性质,总计约 200 万个构象。QMugs 包含通过半经验方法 GFN2-xTB 优化的分子几何形状和热力学数据。原子和分子性质同时提供在 GFN2-xTB 和密度泛函理论(DFT,ωB97X-D/def2-SVP)水平上。QMugs 中的分子大小明显大于以前报道的数据集,并且包含它们各自的量子力学波函数,包括 DFT 密度和轨道矩阵。该数据集旨在促进在不同理论水平上从分子数据中学习模型的发展,同时也深入了解分子结构与生物活性之间的对应关系。