Suppr超能文献

MaTableGPT:基于GPT的材料科学文献表格数据提取器。

MaTableGPT: GPT-Based Table Data Extractor from Materials Science Literature.

作者信息

Yi Gyeong Hoon, Choi Jiwoo, Song Hyeongyun, Miano Olivia, Choi Jaewoong, Bang Kihoon, Lee Byungju, Sohn Seok Su, Buttler David, Hiszpanski Anna, Han Sang Soo, Kim Donghun

机构信息

Computational Science Research Center, Korea Institute of Science and Technology, Seoul, 02792, Republic of Korea.

Department of Materials Science and Engineering, Korea University, Seoul, 02841, Republic of Korea.

出版信息

Adv Sci (Weinh). 2025 Apr;12(16):e2408221. doi: 10.1002/advs.202408221. Epub 2025 Jan 24.

Abstract

Efficiently extracting data from tables in the scientific literature is pivotal for building large-scale databases. However, the tables reported in materials science papers exist in highly diverse forms; thus, rule-based extractions are an ineffective approach. To overcome this challenge, the study presents MaTableGPT, which is a GPT-based table data extractor from the materials science literature. MaTableGPT features key strategies of table data representation and table splitting for better GPT comprehension and filtering hallucinated information through follow-up questions. When applied to a vast volume of water splitting catalysis literature, MaTableGPT achieves an extraction accuracy (total F1 score) of up to 96.8%. Through comprehensive evaluations of the GPT usage cost, labeling cost, and extraction accuracy for the learning methods of zero-shot, few-shot, and fine-tuning, the study presents a Pareto-front mapping where the few-shot learning method is found to be the most balanced solution owing to both its high extraction accuracy (total F1 score >95%) and low cost (GPT usage cost of 5.97 US dollars and labeling cost of 10 I/O paired examples). The statistical analyses conducted on the database generated by MaTableGPT revealed valuable insights into the distribution of the overpotential and elemental utilization across the reported catalysts in the water splitting literature.

摘要

从科学文献的表格中高效提取数据对于构建大规模数据库至关重要。然而,材料科学论文中报道的表格形式高度多样;因此,基于规则的提取方法效果不佳。为了克服这一挑战,该研究提出了MaTableGPT,这是一种基于GPT的从材料科学文献中提取表格数据的工具。MaTableGPT具有表格数据表示和表格拆分的关键策略,以便更好地让GPT理解,并通过后续问题过滤幻觉信息。当应用于大量水分解催化文献时,MaTableGPT的提取准确率(总F1分数)高达96.8%。通过对零样本、少样本和微调学习方法的GPT使用成本、标注成本和提取准确率进行全面评估,该研究呈现了一个帕累托前沿映射,其中少样本学习方法因其高提取准确率(总F1分数>95%)和低成本(GPT使用成本为5.97美元,标注成本为10个输入/输出配对示例)而被发现是最平衡的解决方案。对MaTableGPT生成的数据库进行的统计分析揭示了关于水分解文献中报道的催化剂的过电位和元素利用率分布的有价值见解。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验