Suppr超能文献

QDπ数据集,用于类药物分子、生物聚合物片段及其相互作用的训练数据。

The QDπ dataset, training data for drug-like molecules and biopolymer fragments and their interactions.

作者信息

Zeng Jinzhe, Giese Timothy J, Götz Andreas W, York Darrin M

机构信息

Laboratory for Biomolecular Simulation Research, Institute for Quantitative Biomedicine, and Department of Chemistry and Chemical Biology, Rutgers University, Piscataway, NJ, 08854-8087, USA.

San Diego Supercomputer Center, University of California San Diego, La Jolla, CA, 92093, USA.

出版信息

Sci Data. 2025 Apr 25;12(1):693. doi: 10.1038/s41597-025-04972-3.

Abstract

The development of universal machine learning potentials (MLP) for small organic and drug-like molecules requires large, accurate datasets that span diverse chemical spaces. In this study, we introduce the QDπ dataset which incorporates data taken from several datasets. We use a query-by-committee active learning strategy to extract data from large datasets to maximize the diversity and avoid redundancy as relevant for neural network training to construct the QDπ dataset. The QDπ dataset requires only 1.6 million structures to express the chemical diversity of 13 elements from the various source datasets at the ωB97M-D3(BJ)/def2-TZVPPD level of theory. The QDπ dataset enables creation of flexible target loss functions for neural network training relevant to drug discovery, including information-dense data sets of relative conformational energies and barriers, intermolecular interactions, tautomers and relative protonation energies of drug-like compounds and biomolecular fragments. It is the hope that the high chemical information density and diversity contained in the QDπ dataset will provide a valuable resource for the development of new universal MLPs for drug discovery.

摘要

开发适用于小型有机分子和类药物分子的通用机器学习势(MLP)需要跨越不同化学空间的大型、准确数据集。在本研究中,我们引入了QDπ数据集,该数据集整合了来自多个数据集的数据。我们使用委员会查询主动学习策略从大型数据集中提取数据,以最大化多样性并避免与神经网络训练相关的冗余,从而构建QDπ数据集。在ωB97M-D3(BJ)/def2-TZVPPD理论水平下,QDπ数据集仅需160万个结构就能表达来自各种源数据集的13种元素的化学多样性。QDπ数据集能够为与药物发现相关的神经网络训练创建灵活的目标损失函数,包括相对构象能量和势垒、分子间相互作用、互变异构体以及类药物化合物和生物分子片段的相对质子化能量的信息密集数据集。希望QDπ数据集中包含的高化学信息密度和多样性将为开发用于药物发现的新型通用MLP提供宝贵资源。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3169/12032357/2a7aba993523/41597_2025_4972_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验