Liang Jiechun, Ling Jack, Xu Limin, Zhu Xi
School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), 2001 Longxiang Road, Longgang District, Shenzhen, Guangdong, 518172, China.
BASIS International School Shenzhen, No.198, Yanshan Road,Nanshan District, Shenzhen, Guangdong, China.
Sci Data. 2025 Jun 4;12(1):939. doi: 10.1038/s41597-025-05289-x.
Raman spectroscopy and Infrared (IR) spectroscopy are two important tools in solving the structure and bond properties of molecules. With the development of deep learning methods in material science, there is a growing demand for the quantity and diversity of quantum chemistry data, so as the spectral information. However, plenty of spectra still missing in current datasets. To solve this problem, we applied Gaussian09 to construct a Raman spectrum and IR spectral dataset. In this work, currently a total of 220,000 molecules were extracted from ChEMBL. The number of molecules is increasing and is uploaded regularly. The dataset comprises optimized geometries, vibrational frequencies, IR and Raman intensities, and energies expanding both the breadth and depth of existing quantum chemistry collections. By providing high-fidelity, multidimensional feature sets, this resource enables the training and benchmarking of next-generation models including inferring substructures from spectroscopic fingerprints, assembling molecule structure from spectras, and prediction Raman or IR spectra for novel molecules.
拉曼光谱和红外(IR)光谱是解决分子结构和键性质的两个重要工具。随着材料科学中深度学习方法的发展,对量子化学数据的数量和多样性,以及光谱信息的需求日益增长。然而,当前数据集中仍缺少大量光谱。为了解决这个问题,我们应用高斯09构建了一个拉曼光谱和红外光谱数据集。在这项工作中,目前总共从ChEMBL中提取了220,000个分子。分子数量正在增加,并会定期上传。该数据集包括优化的几何结构、振动频率、红外和拉曼强度以及能量,扩展了现有量子化学集合的广度和深度。通过提供高保真的多维特征集,该资源能够对下一代模型进行训练和基准测试,包括从光谱指纹推断子结构、从光谱组装分子结构以及预测新分子的拉曼或红外光谱。