Zeng Jinzhe, Giese Timothy J, Götz Andreas W, York Darrin M
Laboratory for Biomolecular Simulation Research, Institute for Quantitative Biomedicine, and Department of Chemistry and Chemical Biology, Rutgers University, Piscataway, NJ, 08854-8087, USA.
San Diego Supercomputer Center, University of California San Diego, La Jolla, CA, 92093, USA.
Sci Data. 2025 Apr 25;12(1):693. doi: 10.1038/s41597-025-04972-3.
The development of universal machine learning potentials (MLP) for small organic and drug-like molecules requires large, accurate datasets that span diverse chemical spaces. In this study, we introduce the QDπ dataset which incorporates data taken from several datasets. We use a query-by-committee active learning strategy to extract data from large datasets to maximize the diversity and avoid redundancy as relevant for neural network training to construct the QDπ dataset. The QDπ dataset requires only 1.6 million structures to express the chemical diversity of 13 elements from the various source datasets at the ωB97M-D3(BJ)/def2-TZVPPD level of theory. The QDπ dataset enables creation of flexible target loss functions for neural network training relevant to drug discovery, including information-dense data sets of relative conformational energies and barriers, intermolecular interactions, tautomers and relative protonation energies of drug-like compounds and biomolecular fragments. It is the hope that the high chemical information density and diversity contained in the QDπ dataset will provide a valuable resource for the development of new universal MLPs for drug discovery.
开发适用于小型有机分子和类药物分子的通用机器学习势(MLP)需要跨越不同化学空间的大型、准确数据集。在本研究中,我们引入了QDπ数据集,该数据集整合了来自多个数据集的数据。我们使用委员会查询主动学习策略从大型数据集中提取数据,以最大化多样性并避免与神经网络训练相关的冗余,从而构建QDπ数据集。在ωB97M-D3(BJ)/def2-TZVPPD理论水平下,QDπ数据集仅需160万个结构就能表达来自各种源数据集的13种元素的化学多样性。QDπ数据集能够为与药物发现相关的神经网络训练创建灵活的目标损失函数,包括相对构象能量和势垒、分子间相互作用、互变异构体以及类药物化合物和生物分子片段的相对质子化能量的信息密集数据集。希望QDπ数据集中包含的高化学信息密度和多样性将为开发用于药物发现的新型通用MLP提供宝贵资源。