• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于从文献中提取信息的钙钛矿带隙注释文本数据集PV600。

Annotated textual dataset PV600 of perovskite bandgaps for information extraction from literature.

作者信息

Sipilä Matilda, Mehryary Farrokh, Pyysalo Sampo, Ginter Filip, Todorović Milica

机构信息

University of Turku, Department of Mechanical and Materials Engineering, Turku, 20014, Finland.

University of Turku, TurkuNLP, Department of Computing, Turku, 20014, Finland.

出版信息

Sci Data. 2025 Aug 11;12(1):1401. doi: 10.1038/s41597-025-05637-x.

DOI:10.1038/s41597-025-05637-x
PMID:40790129
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12339702/
Abstract

Scientific literature provides a variety of experimental and theoretical data which, if extracted, could offer new opportunities for data-driven discovery in materials research. Natural language processing (NLP) tools enable information extraction (IE) of structured information from unstructured text. The performance of IE tools needs to be systematically evaluated on manually annotated test datasets, but there are few publicly available annotated materials science datasets and none on perovskites, promising materials for photovoltaics. We present a perovskite literature dataset with 600 text segments extracted from an open access manuscript corpus. The PV600 dataset focuses on five inorganic and hybrid perovskites and contains 227 manually annotated bandgap values identified from 188 segments. Moreover, we recorded the bandgap type, whether it was experimental, computational, from the literature, or from unknown source. To demonstrate the intended use of the dataset, we applied it to evaluate the IE performance of a question answering (QA) method, a rule-based method, and generative language models (LLMs). We exhibit a further application in testing segment preselection with LLMs in IE.

摘要

科学文献提供了各种实验和理论数据,如果对这些数据进行提取,可为材料研究中的数据驱动发现提供新机遇。自然语言处理(NLP)工具能够从非结构化文本中提取结构化信息。信息提取工具的性能需要在人工标注的测试数据集上进行系统评估,但公开可用的材料科学标注数据集很少,且没有关于钙钛矿(一种有前景的光伏材料)的此类数据集。我们展示了一个钙钛矿文献数据集,它包含从开放获取手稿语料库中提取的600个文本片段。PV600数据集聚焦于五种无机和混合钙钛矿,包含从188个片段中识别出的227个手动标注的带隙值。此外,我们记录了带隙类型,即它是实验性的、计算性的、来自文献的还是来源不明的。为了展示该数据集的预期用途,我们将其应用于评估问答(QA)方法、基于规则的方法和生成式语言模型(LLM)的信息提取性能。我们还展示了在信息提取中使用语言模型进行测试片段预选的进一步应用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/67f2/12339702/d4e262bf782a/41597_2025_5637_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/67f2/12339702/80b08793aa55/41597_2025_5637_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/67f2/12339702/a30eec6b7cd4/41597_2025_5637_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/67f2/12339702/4d1a745cb159/41597_2025_5637_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/67f2/12339702/bdf1ccacb9e9/41597_2025_5637_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/67f2/12339702/999e15666fb3/41597_2025_5637_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/67f2/12339702/1982a735e752/41597_2025_5637_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/67f2/12339702/d4e262bf782a/41597_2025_5637_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/67f2/12339702/80b08793aa55/41597_2025_5637_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/67f2/12339702/a30eec6b7cd4/41597_2025_5637_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/67f2/12339702/4d1a745cb159/41597_2025_5637_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/67f2/12339702/bdf1ccacb9e9/41597_2025_5637_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/67f2/12339702/999e15666fb3/41597_2025_5637_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/67f2/12339702/1982a735e752/41597_2025_5637_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/67f2/12339702/d4e262bf782a/41597_2025_5637_Fig7_HTML.jpg

相似文献

1
Annotated textual dataset PV600 of perovskite bandgaps for information extraction from literature.用于从文献中提取信息的钙钛矿带隙注释文本数据集PV600。
Sci Data. 2025 Aug 11;12(1):1401. doi: 10.1038/s41597-025-05637-x.
2
Can open source large language models be used for tumor documentation in Germany?-An evaluation on urological doctors' notes.在德国,开源大语言模型可用于肿瘤记录吗?——对泌尿科医生笔记的评估
BioData Min. 2025 Jul 24;18(1):48. doi: 10.1186/s13040-025-00463-8.
3
Automated Transformation of Unstructured Cardiovascular Diagnostic Reports into Structured Datasets Using Sequentially Deployed Large Language Models.使用顺序部署的大语言模型将非结构化心血管诊断报告自动转换为结构化数据集
medRxiv. 2024 Oct 8:2024.10.08.24315035. doi: 10.1101/2024.10.08.24315035.
4
Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中,如果患者出现以下症状和体征,可判断其是否患有 COVID-19。
Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.
5
Generative AI/LLMs for Plain Language Medical Information for Patients, Caregivers and General Public: Opportunities, Risks and Ethics.用于为患者、护理人员和普通公众提供通俗易懂的医学信息的生成式人工智能/大型语言模型:机遇、风险与伦理
Patient Prefer Adherence. 2025 Jul 31;19:2227-2249. doi: 10.2147/PPA.S527922. eCollection 2025.
6
Extraction of sleep information from clinical notes of Alzheimer's disease patients using natural language processing.使用自然语言处理从阿尔茨海默病患者的临床记录中提取睡眠信息。
J Am Med Inform Assoc. 2024 Oct 1;31(10):2217-2227. doi: 10.1093/jamia/ocae177.
7
Therapeutics for treating mpox in humans.人类天花治疗方法。
Cochrane Database Syst Rev. 2023 Mar 14;3(3):CD015769. doi: 10.1002/14651858.CD015769.
8
Transient elastography for diagnosis of stages of hepatic fibrosis and cirrhosis in people with alcoholic liver disease.瞬时弹性成像技术用于诊断酒精性肝病患者的肝纤维化和肝硬化分期。
Cochrane Database Syst Rev. 2015 Jan 22;1(1):CD010542. doi: 10.1002/14651858.CD010542.pub2.
9
Inhaled mannitol for cystic fibrosis.吸入用甘露醇治疗囊性纤维化。
Cochrane Database Syst Rev. 2015 Oct 9(10):CD008649. doi: 10.1002/14651858.CD008649.pub2.
10
Cost-effectiveness of using prognostic information to select women with breast cancer for adjuvant systemic therapy.利用预后信息为乳腺癌患者选择辅助性全身治疗的成本效益
Health Technol Assess. 2006 Sep;10(34):iii-iv, ix-xi, 1-204. doi: 10.3310/hta10340.

本文引用的文献

1
Harnessing Large Language Models to Collect and Analyze Metal-Organic Framework Property Data Set.利用大语言模型收集和分析金属有机框架属性数据集。
J Am Chem Soc. 2025 Feb 5;147(5):3943-3958. doi: 10.1021/jacs.4c11085. Epub 2025 Jan 21.
2
Single and Multi-Hop Question-Answering Datasets for Reticular Chemistry with GPT-4-Turbo.用于网状化学的单跳和多跳问答数据集与GPT-4-Turbo
J Chem Theory Comput. 2024 Oct 22;20(20):9128-9137. doi: 10.1021/acs.jctc.4c00805. Epub 2024 Oct 8.
3
Large Language Models for Inorganic Synthesis Predictions.
用于无机合成预测的大语言模型
J Am Chem Soc. 2024 Jul 24;146(29):19654-19659. doi: 10.1021/jacs.4c05840. Epub 2024 Jul 11.
4
Large dynamic scissoring mode displacements coupled to band gap opening in the cubic phase of the methylammonium lead halide perovskites.在甲脒铅卤化物钙钛矿的立方相中,大动态剪刀模式位移与带隙打开相关联。
J Phys Condens Matter. 2024 Jul 15;36(41). doi: 10.1088/1361-648X/ad5b44.
5
How Beneficial Is Pretraining on a Narrow Domain-Specific Corpus for Information Extraction about Photocatalytic Water Splitting?针对光催化水分解信息提取,在特定领域的狭窄语料库上进行预训练有多大益处?
J Chem Inf Model. 2024 Apr 22;64(8):3205-3212. doi: 10.1021/acs.jcim.4c00063. Epub 2024 Mar 27.
6
Structured information extraction from scientific text with large language models.利用大语言模型从科学文本中提取结构化信息。
Nat Commun. 2024 Feb 15;15(1):1418. doi: 10.1038/s41467-024-45563-x.
7
ChatGPT Chemistry Assistant for Text Mining and the Prediction of MOF Synthesis.用于文本挖掘和金属有机框架合成预测的ChatGPT化学助手
J Am Chem Soc. 2023 Aug 16;145(32):18048-18062. doi: 10.1021/jacs.3c05819. Epub 2023 Aug 7.
8
Recent advances in perovskite oxides for non-enzymatic electrochemical sensors: A review.钙钛矿氧化物在非酶电化学传感器中的最新进展:综述。
Anal Chim Acta. 2023 Apr 22;1251:341007. doi: 10.1016/j.aca.2023.341007. Epub 2023 Feb 25.
9
Theoretical and experimental investigations on the bulk photovoltaic effect in lead-free perovskites MASnI and FASnI.关于无铅钙钛矿MASnI和FASnI中体光伏效应的理论与实验研究。
RSC Adv. 2020 Apr 14;10(25):14679-14688. doi: 10.1039/d0ra02584d. eCollection 2020 Apr 8.
10
ChemDataExtractor 2.0: Autopopulated Ontologies for Materials Science.ChemDataExtractor 2.0:材料科学自动填充本体。
J Chem Inf Model. 2021 Sep 27;61(9):4280-4289. doi: 10.1021/acs.jcim.1c00446. Epub 2021 Sep 16.