用于177K个专利提取有机分子的红外-核磁共振多模态计算光谱数据集。

IR-NMR multimodal computational spectra dataset for 177K patent-extracted organic molecules.

作者信息

Zipoli Federico, Alberts Marvin, Laino Teodoro

机构信息

IBM Research Europe, Saümerstrasse 4, 8803, Rüschlikon, Switzerland.

NCCR Catalysis, Zurich, Switzerland.

出版信息

Sci Data. 2025 Aug 7;12(1):1375. doi: 10.1038/s41597-025-05729-8.

DOI:10.1038/s41597-025-05729-8

PMID:40775416

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12331906/

Abstract

The construction of predictive models in molecular science increasingly relies on large, high-quality datasets. Synthetic data generation is becoming a foundational strategy for advancing model accuracy and enabling fast discovery workflows. To support the development of structure elucidation and spectral property prediction models, we present a comprehensive synthetic dataset of infrared (IR) and nuclear magnetic resonance (NMR) spectra for a diverse ensemble of organic molecules. The data were generated using a hybrid computational approach that integrates molecular dynamics (MD) simulations, density functional theory (DFT) calculations, and machine learning (ML) models. The dataset primarily consists of IR spectra for 177,461 molecules, derived from long-timescale MD simulations with ML-accelerated dipole moment predictions. In addition, it includes a smaller subset of H-NMR and C-NMR chemical shifts for 1,255 molecules. This unique combination of spectral data offers a valuable resource for benchmarking and validating computational methodologies, developing and enhancing artificial intelligence (AI) models for molecular property prediction, and facilitating the interpretation of experimental spectroscopic results. The dataset is publicly available through Zenodo, encouraging its broad utilization within the scientific community.

摘要

分子科学中预测模型的构建越来越依赖于大型高质量数据集。合成数据生成正成为提高模型准确性和实现快速发现工作流程的基础策略。为了支持结构解析和光谱性质预测模型的开发，我们提供了一个包含多种有机分子的红外（IR）和核磁共振（NMR）光谱的综合合成数据集。这些数据是使用一种混合计算方法生成的，该方法整合了分子动力学（MD）模拟、密度泛函理论（DFT）计算和机器学习（ML）模型。该数据集主要由177,461个分子的红外光谱组成，这些光谱来自具有ML加速偶极矩预测的长时间尺度MD模拟。此外，它还包括1,255个分子的较小的H-NMR和C-NMR化学位移子集。这种独特的光谱数据组合为基准测试和验证计算方法、开发和增强用于分子性质预测的人工智能（AI）模型以及促进对实验光谱结果的解释提供了宝贵资源。该数据集可通过Zenodo公开获取，鼓励科学界广泛使用。