Cavendish Laboratory, University of Cambridge, J. J. Thomson Avenue, Cambridge, CB3 0HE, UK.
ISIS Neutron and Muon Source, STFC Rutherford Appleton Laboratory, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0QX, UK.
Sci Data. 2022 Oct 22;9(1):648. doi: 10.1038/s41597-022-01752-1.
An auto-generated thermoelectric-materials database is presented, containing 22,805 data records, automatically generated from the scientific literature, spanning 10,641 unique extracted chemical names. Each record contains a chemical entity and one of the seminal thermoelectric properties: thermoelectric figure of merit, ZT; thermal conductivity, κ; Seebeck coefficient, S; electrical conductivity, σ; power factor, PF; each linked to their corresponding recorded temperature, T. The database was auto-generated using the automatic sentence-parsing capabilities of the chemistry-aware, natural language processing toolkit, ChemDataExtractor 2.0, adapted for application in the thermoelectric-materials domain, following a rule-based sentence-simplification step. Data were mined from the text of 60,843 scientific papers that were sourced from three scientific publishers: Elsevier, the Royal Society of Chemistry, and Springer. To the best of our knowledge, this is the first automatically-generated database of thermoelectric materials and their properties from existing literature. The database was evaluated to have a precision of 82.25% and has been made publicly available to facilitate the application of data science in the thermoelectric-materials domain, for analysis, design, and prediction.
本文呈现了一个自动生成的热电材料数据库,其中包含 22805 条数据记录,这些记录是从科学文献中自动提取的,涵盖了 10641 个独特的化学名称。每条记录包含一个化学实体和一个热电性质:热电优值 ZT、热导率 κ、塞贝克系数 S、电导率 σ、功率因子 PF,每个性质都与对应的记录温度 T 相关联。该数据库是使用化学感知自然语言处理工具包 ChemDataExtractor 2.0 的自动句子解析功能自动生成的,该功能经过改编,适用于热电材料领域,采用基于规则的句子简化步骤。数据是从 Elsevier、英国皇家化学学会和 Springer 这三个科学出版商的 60843 篇科学论文的文本中挖掘出来的。据我们所知,这是第一个从现有文献中自动生成的热电材料及其性能的数据库。该数据库的精度为 82.25%,现已公开,以促进数据科学在热电材料领域的应用,用于分析、设计和预测。