Suppr
超能文献

利用对话式语言模型和提示工程从研究论文中提取准确的材料数据。

Extracting accurate materials data from research papers with conversational language models and prompt engineering.

作者信息

Polak Maciej P, Morgan Dane

机构信息

Department of Materials Science and Engineering, University of Wisconsin-Madison, Madison, WI, 53706-1595, USA.

出版信息

Nat Commun. 2024 Feb 21;15(1):1569. doi: 10.1038/s41467-024-45914-8.

DOI:10.1038/s41467-024-45914-8

PMID:38383556

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10882009/

Abstract

There has been a growing effort to replace manual extraction of data from research papers with automated data extraction based on natural language processing, language models, and recently, large language models (LLMs). Although these methods enable efficient extraction of data from large sets of research papers, they require a significant amount of up-front effort, expertise, and coding. In this work, we propose the ChatExtract method that can fully automate very accurate data extraction with minimal initial effort and background, using an advanced conversational LLM. ChatExtract consists of a set of engineered prompts applied to a conversational LLM that both identify sentences with data, extract that data, and assure the data's correctness through a series of follow-up questions. These follow-up questions largely overcome known issues with LLMs providing factually inaccurate responses. ChatExtract can be applied with any conversational LLMs and yields very high quality data extraction. In tests on materials data, we find precision and recall both close to 90% from the best conversational LLMs, like GPT-4. We demonstrate that the exceptional performance is enabled by the information retention in a conversational model combined with purposeful redundancy and introducing uncertainty through follow-up prompts. These results suggest that approaches similar to ChatExtract, due to their simplicity, transferability, and accuracy are likely to become powerful tools for data extraction in the near future. Finally, databases for critical cooling rates of metallic glasses and yield strengths of high entropy alloys are developed using ChatExtract.

摘要

人们越来越努力用基于自然语言处理、语言模型，以及最近的大语言模型（LLM）的自动数据提取来取代从研究论文中手动提取数据的方式。尽管这些方法能够从大量研究论文中高效提取数据，但它们需要大量的前期工作、专业知识和编码。在这项工作中，我们提出了ChatExtract方法，该方法可以使用先进的对话式大语言模型，以最少的初始工作量和背景知识，完全自动化地进行非常准确的数据提取。ChatExtract由一组应用于对话式大语言模型的设计好的提示组成，这些提示既能识别包含数据的句子，提取数据，又能通过一系列后续问题确保数据的正确性。这些后续问题在很大程度上克服了大语言模型提供事实不准确回答的已知问题。ChatExtract可以与任何对话式大语言模型一起应用，并能产生非常高质量的数据提取结果。在材料数据测试中，我们发现像GPT - 4这样最好的对话式大语言模型的精确率和召回率都接近90%。我们证明，对话模型中的信息保留、有目的的冗余以及通过后续提示引入不确定性，使得该方法具有卓越的性能。这些结果表明，由于其简单性、可转移性和准确性，类似于ChatExtract的方法在不久的将来可能会成为强大的数据提取工具。最后，使用ChatExtract开发了金属玻璃临界冷却速率和高熵合金屈服强度的数据库。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bddc/10882009/0f88c35fa029/41467_2024_45914_Fig1_HTML.jpg

相似文献

Extracting accurate materials data from research papers with conversational language models and prompt engineering.

Nat Commun. 2024 Feb 21;15(1):1569. doi: 10.1038/s41467-024-45914-8.

A Systematic Review of ChatGPT and Other Conversational Large Language Models in Healthcare.

medRxiv. 2024 Apr 27:2024.04.26.24306390. doi: 10.1101/2024.04.26.24306390.

An Empirical Evaluation of Prompting Strategies for Large Language Models in Zero-Shot Clinical Natural Language Processing: Algorithm Development and Validation Study.

JMIR Med Inform. 2024 Apr 8;12:e55318. doi: 10.2196/55318.

Quality of Answers of Generative Large Language Models Versus Peer Users for Interpreting Laboratory Test Results for Lay Patients: Evaluation Study.

J Med Internet Res. 2024 Apr 17;26:e56655. doi: 10.2196/56655.

Quality of Answers of Generative Large Language Models vs Peer Patients for Interpreting Lab Test Results for Lay Patients: Evaluation Study.

ArXiv. 2024 Jan 23:arXiv:2402.01693v1.

Large Language Models Can Enable Inductive Thematic Analysis of a Social Media Corpus in a Single Prompt: Human Validation Study.

JMIR Infodemiology. 2024 Aug 29;4:e59641. doi: 10.2196/59641.

Prompt Engineering as an Important Emerging Skill for Medical Professionals: Tutorial.

J Med Internet Res. 2023 Oct 4;25:e50638. doi: 10.2196/50638.

Improving large language models for clinical named entity recognition via prompt engineering.

J Am Med Inform Assoc. 2024 Sep 1;31(9):1812-1820. doi: 10.1093/jamia/ocad259.

Using Large Language Models to Annotate Complex Cases of Social Determinants of Health in Longitudinal Clinical Records.

medRxiv. 2024 Apr 27:2024.04.25.24306380. doi: 10.1101/2024.04.25.24306380.

Learning to Make Rare and Complex Diagnoses With Generative AI Assistance: Qualitative Study of Popular Large Language Models.

JMIR Med Educ. 2024 Feb 13;10:e51391. doi: 10.2196/51391.

引用本文的文献

Steering towards safe self-driving laboratories.

Nat Rev Chem. 2025 Aug 18. doi: 10.1038/s41570-025-00747-x.

Large language model driven transferable key information extraction mechanism for nonstandardized tables.

Sci Rep. 2025 Aug 14;15(1):29802. doi: 10.1038/s41598-025-15627-z.

Autogenerating a Domain-Specific Question-Answering Data Set from a Thermoelectric Materials Database to Enable High-Performing BERT Models.

J Chem Inf Model. 2025 Aug 25;65(16):8579-8592. doi: 10.1021/acs.jcim.5c00840. Epub 2025 Aug 7.

Mapping the harvest area of a comprehensive set of crop types in China from 1990 to 2020 at a 1-km resolution.

Sci Data. 2025 Aug 6;12(1):1371. doi: 10.1038/s41597-025-05723-0.

Using Large Languge Models for Processing Sensor Data.

Sensors (Basel). 2025 Jul 13;25(14):4380. doi: 10.3390/s25144380.

Mechanical performance dataset for alloy with applications at low temperatures.

Sci Data. 2025 Jul 15;12(1):1235. doi: 10.1038/s41597-025-05512-9.

AutoPM3: enhancing variant interpretation via LLM-driven PM3 evidence extraction from scientific literature.

Bioinformatics. 2025 Jul 1;41(7). doi: 10.1093/bioinformatics/btaf382.

Generative artificial intelligence, integrative bioinformatics, and single-cell analysis reveal Alzheimer's genetic and immune landscape.

Mol Ther Nucleic Acids. 2025 Apr 24;36(2):102546. doi: 10.1016/j.omtn.2025.102546. eCollection 2025 Jun 10.

NMRExtractor: leveraging large language models to construct an experimental NMR database from open-source scientific publications.

Chem Sci. 2025 May 28. doi: 10.1039/d4sc08802f.

The Use of Large Language Models to Accelerate Literature Review Towards Digital Health Equity and Inclusiveness.

AMIA Annu Symp Proc. 2025 May 22;2024:493-502. eCollection 2024.

本文引用的文献

A rule-free workflow for the automated generation of databases from scientific literature.

NPJ Comput Mater. 2023;9(1):222. doi: 10.1038/s41524-023-01171-9. Epub 2023 Dec 13.

A thermoelectric materials database auto-generated from the scientific literature using ChemDataExtractor.

Sci Data. 2022 Oct 22;9(1):648. doi: 10.1038/s41597-022-01752-1.

Machine-Learning Rationalization and Prediction of Solid-State Synthesis Conditions.

Chem Mater. 2022 Aug 23;34(16):7323-7336. doi: 10.1021/acs.chemmater.2c01293. Epub 2022 Aug 5.

Perovskite- and Dye-Sensitized Solar-Cell Device Databases Auto-generated Using ChemDataExtractor.

Sci Data. 2022 Jun 17;9(1):329. doi: 10.1038/s41597-022-01355-w.

Dataset of solution-based inorganic materials synthesis procedures extracted from the scientific literature.

Sci Data. 2022 May 25;9(1):231. doi: 10.1038/s41597-022-01317-2.

Reconstructing Chromatic-Dispersion Relations and Predicting Refractive Indices Using Text Mining and Machine Learning.

J Chem Inf Model. 2022 Jun 13;62(11):2670-2684. doi: 10.1021/acs.jcim.2c00253. Epub 2022 May 19.

A database of refractive indices and dielectric constants auto-generated using ChemDataExtractor.

Sci Data. 2022 May 3;9(1):192. doi: 10.1038/s41597-022-01295-5.

Auto-generated database of semiconductor band gaps using ChemDataExtractor.

Sci Data. 2022 May 3;9(1):193. doi: 10.1038/s41597-022-01294-6.

ChemDataExtractor 2.0: Autopopulated Ontologies for Materials Science.

J Chem Inf Model. 2021 Sep 27;61(9):4280-4289. doi: 10.1021/acs.jcim.1c00446. Epub 2021 Sep 16.

Opportunities and challenges of text mining in aterials research.

iScience. 2021 Feb 6;24(3):102155. doi: 10.1016/j.isci.2021.102155. eCollection 2021 Mar 19.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

Suppr超能文献

利用对话式语言模型和提示工程从研究论文中提取准确的材料数据。

Extracting accurate materials data from research papers with conversational language models and prompt engineering.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译