Suppr超能文献

利用对话式语言模型和提示工程从研究论文中提取准确的材料数据。

Extracting accurate materials data from research papers with conversational language models and prompt engineering.

作者信息

Polak Maciej P, Morgan Dane

机构信息

Department of Materials Science and Engineering, University of Wisconsin-Madison, Madison, WI, 53706-1595, USA.

出版信息

Nat Commun. 2024 Feb 21;15(1):1569. doi: 10.1038/s41467-024-45914-8.

Abstract

There has been a growing effort to replace manual extraction of data from research papers with automated data extraction based on natural language processing, language models, and recently, large language models (LLMs). Although these methods enable efficient extraction of data from large sets of research papers, they require a significant amount of up-front effort, expertise, and coding. In this work, we propose the ChatExtract method that can fully automate very accurate data extraction with minimal initial effort and background, using an advanced conversational LLM. ChatExtract consists of a set of engineered prompts applied to a conversational LLM that both identify sentences with data, extract that data, and assure the data's correctness through a series of follow-up questions. These follow-up questions largely overcome known issues with LLMs providing factually inaccurate responses. ChatExtract can be applied with any conversational LLMs and yields very high quality data extraction. In tests on materials data, we find precision and recall both close to 90% from the best conversational LLMs, like GPT-4. We demonstrate that the exceptional performance is enabled by the information retention in a conversational model combined with purposeful redundancy and introducing uncertainty through follow-up prompts. These results suggest that approaches similar to ChatExtract, due to their simplicity, transferability, and accuracy are likely to become powerful tools for data extraction in the near future. Finally, databases for critical cooling rates of metallic glasses and yield strengths of high entropy alloys are developed using ChatExtract.

摘要

人们越来越努力用基于自然语言处理、语言模型,以及最近的大语言模型(LLM)的自动数据提取来取代从研究论文中手动提取数据的方式。尽管这些方法能够从大量研究论文中高效提取数据,但它们需要大量的前期工作、专业知识和编码。在这项工作中,我们提出了ChatExtract方法,该方法可以使用先进的对话式大语言模型,以最少的初始工作量和背景知识,完全自动化地进行非常准确的数据提取。ChatExtract由一组应用于对话式大语言模型的设计好的提示组成,这些提示既能识别包含数据的句子,提取数据,又能通过一系列后续问题确保数据的正确性。这些后续问题在很大程度上克服了大语言模型提供事实不准确回答的已知问题。ChatExtract可以与任何对话式大语言模型一起应用,并能产生非常高质量的数据提取结果。在材料数据测试中,我们发现像GPT - 4这样最好的对话式大语言模型的精确率和召回率都接近90%。我们证明,对话模型中的信息保留、有目的的冗余以及通过后续提示引入不确定性,使得该方法具有卓越的性能。这些结果表明,由于其简单性、可转移性和准确性,类似于ChatExtract的方法在不久的将来可能会成为强大的数据提取工具。最后,使用ChatExtract开发了金属玻璃临界冷却速率和高熵合金屈服强度的数据库。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bddc/10882009/0f88c35fa029/41467_2024_45914_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验