Schilling-Wilhelmi Mara, Ríos-García Martiño, Shabih Sherjeel, Gil María Victoria, Miret Santiago, Koch Christoph T, Márquez José A, Jablonka Kevin Maik
Laboratory of Organic and Macromolecular Chemistry (IOMC), Friedrich Schiller University Jena, Humboldtstrasse 10, 07743 Jena, Germany.
Institute of Carbon Science and Technology (INCAR), CSIC, Francisco Pintado Fe 26, 33011 Oviedo, Spain.
Chem Soc Rev. 2025 Feb 3;54(3):1125-1150. doi: 10.1039/d4cs00913d.
The vast majority of chemical knowledge exists in unstructured natural language, yet structured data is crucial for innovative and systematic materials design. Traditionally, the field has relied on manual curation and partial automation for data extraction for specific use cases. The advent of large language models (LLMs) represents a significant shift, potentially enabling non-experts to extract structured, actionable data from unstructured text efficiently. While applying LLMs to chemical and materials science data extraction presents unique challenges, domain knowledge offers opportunities to guide and validate LLM outputs. This tutorial review provides a comprehensive overview of LLM-based structured data extraction in chemistry, synthesizing current knowledge and outlining future directions. We address the lack of standardized guidelines and present frameworks for leveraging the synergy between LLMs and chemical expertise. This work serves as a foundational resource for researchers aiming to harness LLMs for data-driven chemical research. The insights presented here could significantly enhance how researchers across chemical disciplines access and utilize scientific information, potentially accelerating the development of novel compounds and materials for critical societal needs.
绝大多数化学知识以非结构化的自然语言存在,但结构化数据对于创新和系统的材料设计至关重要。传统上,该领域依赖人工整理和针对特定用例的数据提取部分自动化。大语言模型(LLMs)的出现代表了一个重大转变,有可能使非专家能够高效地从非结构化文本中提取结构化的、可操作的数据。虽然将大语言模型应用于化学和材料科学数据提取存在独特挑战,但领域知识提供了指导和验证大语言模型输出的机会。本教程综述全面概述了基于大语言模型的化学结构化数据提取,综合了当前知识并概述了未来方向。我们解决了缺乏标准化指南的问题,并提出了利用大语言模型与化学专业知识协同作用的框架。这项工作为旨在利用大语言模型进行数据驱动化学研究的研究人员提供了基础资源。这里提出的见解可能会显著改善化学各学科研究人员获取和利用科学信息的方式,有可能加速开发满足关键社会需求的新型化合物和材料。