Suppr超能文献

用于从文献中提取信息的钙钛矿带隙注释文本数据集PV600。

Annotated textual dataset PV600 of perovskite bandgaps for information extraction from literature.

作者信息

Sipilä Matilda, Mehryary Farrokh, Pyysalo Sampo, Ginter Filip, Todorović Milica

机构信息

University of Turku, Department of Mechanical and Materials Engineering, Turku, 20014, Finland.

University of Turku, TurkuNLP, Department of Computing, Turku, 20014, Finland.

出版信息

Sci Data. 2025 Aug 11;12(1):1401. doi: 10.1038/s41597-025-05637-x.

Abstract

Scientific literature provides a variety of experimental and theoretical data which, if extracted, could offer new opportunities for data-driven discovery in materials research. Natural language processing (NLP) tools enable information extraction (IE) of structured information from unstructured text. The performance of IE tools needs to be systematically evaluated on manually annotated test datasets, but there are few publicly available annotated materials science datasets and none on perovskites, promising materials for photovoltaics. We present a perovskite literature dataset with 600 text segments extracted from an open access manuscript corpus. The PV600 dataset focuses on five inorganic and hybrid perovskites and contains 227 manually annotated bandgap values identified from 188 segments. Moreover, we recorded the bandgap type, whether it was experimental, computational, from the literature, or from unknown source. To demonstrate the intended use of the dataset, we applied it to evaluate the IE performance of a question answering (QA) method, a rule-based method, and generative language models (LLMs). We exhibit a further application in testing segment preselection with LLMs in IE.

摘要

科学文献提供了各种实验和理论数据,如果对这些数据进行提取,可为材料研究中的数据驱动发现提供新机遇。自然语言处理(NLP)工具能够从非结构化文本中提取结构化信息。信息提取工具的性能需要在人工标注的测试数据集上进行系统评估,但公开可用的材料科学标注数据集很少,且没有关于钙钛矿(一种有前景的光伏材料)的此类数据集。我们展示了一个钙钛矿文献数据集,它包含从开放获取手稿语料库中提取的600个文本片段。PV600数据集聚焦于五种无机和混合钙钛矿,包含从188个片段中识别出的227个手动标注的带隙值。此外,我们记录了带隙类型,即它是实验性的、计算性的、来自文献的还是来源不明的。为了展示该数据集的预期用途,我们将其应用于评估问答(QA)方法、基于规则的方法和生成式语言模型(LLM)的信息提取性能。我们还展示了在信息提取中使用语言模型进行测试片段预选的进一步应用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/67f2/12339702/80b08793aa55/41597_2025_5637_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验