Cavendish Laboratory, University of Cambridge, J. J. Thomson Avenue, Cambridge, CB3 0HE, UK.
ISIS Neutron and Muon Source, Rutherford Appleton Laboratory, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0QX, UK.
Sci Data. 2022 May 3;9(1):193. doi: 10.1038/s41597-022-01294-6.
Large-scale databases of band gap information about semiconductors that are curated from the scientific literature have significant usefulness for computational databases and general semiconductor materials research. This work presents an auto-generated database of 100,236 semiconductor band gap records, extracted from 128,776 journal articles with their associated temperature information. The database was produced using ChemDataExtractor version 2.0, a 'chemistry-aware' software toolkit that uses Natural Language Processing (NLP) and machine-learning methods to extract chemical data from scientific documents. The modified Snowball algorithm of ChemDataExtractor has been extended to incorporate nested models, optimized by hyperparameter analysis, and used together with the default NLP parsers to achieve optimal quality of the database. Evaluation of the database shows a weighted precision of 84% and a weighted recall of 65%. To the best of our knowledge, this is the largest open-source non-computational band gap database to date. Database records are available in CSV, JSON, and MongoDB formats, which are machine readable and can assist data mining and semiconductor materials discovery.
从科学文献中整理出的关于半导体的大规模带隙信息数据库,对于计算数据库和一般半导体材料研究具有重要的作用。本工作从 128776 篇期刊文章中提取了 100236 条半导体带隙记录,这些文章都包含其相关的温度信息。该数据库是使用 ChemDataExtractor 版本 2.0 生成的,这是一个“化学感知”软件工具包,它使用自然语言处理 (NLP) 和机器学习方法从科学文献中提取化学数据。ChemDataExtractor 的修改版 Snowball 算法已扩展为包含嵌套模型,并通过超参数分析进行了优化,与默认的 NLP 解析器一起使用,以达到数据库的最佳质量。对该数据库的评估显示,加权精度为 84%,加权召回率为 65%。据我们所知,这是迄今为止最大的开源非计算带隙数据库。数据库记录以 CSV、JSON 和 MongoDB 格式提供,这些格式都是机器可读的,可以辅助数据挖掘和半导体材料的发现。