Tan Xiaofeng
X Scientific, 1 Bramble Way, Acton, MA, 01720, USA.
J Cheminform. 2025 Jul 12;17(1):103. doi: 10.1186/s13321-025-01016-1.
For over half a century, computer-aided structural elucidation systems (CASE) for organic compounds have relied on complex expert systems with explicitly programmed algorithms. These systems are often computationally inefficient for complex compounds due to the vast chemical structural space that must be explored and filtered. In this study, we present a proof-of-concept transformer based generative chemical language artificial intelligence (AI) model, an innovative end-to-end architecture designed to replace the logic and workflow of the classic CASE framework for ultra-fast and accurate spectroscopic-based structural elucidation. Our model employs an encoder-decoder architecture and self-attention mechanisms, similar to those in large language models, to directly generate the most probable chemical structures that match the input spectroscopic data. Trained on ~ 102 k IR, UV, and H NMR spectra, it performs structural elucidation of molecules with up to 29 atoms in just a few seconds on a modern CPU, achieving a top-15 accuracy of 83%. This approach demonstrates the potential of transformer based generative AI to accelerate traditional scientific problem-solving processes. The model's ability to iterate quickly based on new data highlights its potential for rapid advancements in structural elucidation.
半个多世纪以来,用于有机化合物的计算机辅助结构解析系统(CASE)一直依赖于具有明确编程算法的复杂专家系统。由于必须探索和筛选巨大的化学结构空间,这些系统对于复杂化合物的计算效率往往较低。在本研究中,我们展示了一种基于生成式化学语言人工智能(AI)的概念验证变压器模型,这是一种创新的端到端架构,旨在取代经典CASE框架的逻辑和工作流程,以实现基于光谱的超快且准确的结构解析。我们的模型采用了编码器 - 解码器架构和自注意力机制,类似于大型语言模型中的机制,直接生成与输入光谱数据匹配的最可能的化学结构。在约10.2万个红外、紫外和氢核磁共振光谱上进行训练后,它在现代CPU上只需几秒钟就能对多达29个原子的分子进行结构解析,实现了83%的前15准确率。这种方法展示了基于变压器的生成式人工智能在加速传统科学问题解决过程方面的潜力。该模型基于新数据快速迭代的能力突出了其在结构解析方面快速进步的潜力。