Zhu Yifei, Li Mengge, Xu Chao, Lan Zhenggang
SCNU Environmental Research Institute, Guangdong Provincial Key Laboratory of Chemical Pollution and Environmental Safety, MOE Key Laboratory of Environmental Theoretical Chemistry, South China Normal University, Guangzhou, 510006, P. R. China.
School of Environment, South China Normal University, Guangzhou, 510006, P. R. China.
Sci Data. 2024 Aug 29;11(1):948. doi: 10.1038/s41597-024-03788-x.
Due to rapid advancements in deep learning techniques, the demand for large-volume high-quality datasets grows significantly in chemical research. We developed a quantum-chemistry database that includes 443,106 small organic molecules with sizes up to 10 heavy atoms including C, N, O, and F. Ground-state geometry optimizations and frequency calculations of all compounds were performed at the B3LYP/6-31G* level with the BJD3 dispersion correction, while the excited-state single-point calculations were conducted at the ωB97X-D/6-31G* level. Totally twenty-seven molecular properties, such as geometric, thermodynamic, electronic and energetic properties, were gathered from these calculations. Meanwhile, we also established a comprehensive protocol for the construction of a high-volume quantum-chemistry dataset. Our QCDGE (Quantum Chemistry Dataset with Ground- and Excited-State Properties) dataset contains a substantial volume of data, exhibits high chemical diversity, and most importantly includes excited-state information. This dataset, along with its construction protocol, is expected to have a significant impact on the broad applications of machine learning studies across different fields of chemistry, especially in the area of excited-state research.
由于深度学习技术的快速发展,化学研究中对大容量高质量数据集的需求显著增长。我们开发了一个量子化学数据库,其中包含443,106个小有机分子,其大小可达10个重原子,包括碳(C)、氮(N)、氧(O)和氟(F)。所有化合物的基态几何结构优化和频率计算均在B3LYP/6-31G水平并采用BJD3色散校正进行,而激发态单点计算则在ωB97X-D/6-31G水平进行。从这些计算中总共收集了二十七个分子性质,如几何、热力学、电子和能量性质。同时,我们还建立了一个用于构建大容量量子化学数据集的综合方案。我们的QCDGE(具有基态和激发态性质的量子化学数据集)数据集包含大量数据,具有高度的化学多样性,最重要的是包含激发态信息。该数据集及其构建方案预计将对机器学习研究在化学不同领域的广泛应用产生重大影响,特别是在激发态研究领域。