Cavendish Laboratory, University of Cambridge, J. J. Thomson Avenue, Cambridge CB3 0HE, U.K.
ISIS Neutron and Muon Source, Rutherford Appleton Laboratory, Harwell Science and Innovation Campus, Didcot, Oxfordshire OX11 0QX, U.K.
J Chem Inf Model. 2023 Apr 10;63(7):1961-1981. doi: 10.1021/acs.jcim.2c01259. Epub 2023 Mar 20.
Text mining in the optical-materials domain is becoming increasingly important as the number of scientific publications in this area grows rapidly. Language models such as Bidirectional Encoder Representations from Transformers (BERT) have opened up a new era and brought a significant boost to state-of-the-art natural-language-processing (NLP) tasks. In this paper, we present two "materials-aware" text-based language models for optical research, OpticalBERT and OpticalPureBERT, which are trained on a large corpus of scientific literature in the optical-materials domain. These two models outperform BERT and previous state-of-the-art models in a variety of text-mining tasks about optical materials. We also release the first "materials-aware" table-based language model, OpticalTable-SQA. This is a querying facility that solicits answers to questions about optical materials using tabular information that pertains to this scientific domain. The OpticalTable-SQA model was realized by fine-tuning the Tapas-SQA model using a manually annotated OpticalTableQA data set which was curated specifically for this work. While preserving its sequential question-answering performance on general tables, the OpticalTable-SQA model significantly outperforms Tapas-SQA on optical-materials-related tables. All models and data sets are available to the optical-materials-science community.
光学材料领域的文本挖掘变得越来越重要,因为该领域的科学出版物数量迅速增长。像 BERT 这样的语言模型为自然语言处理(NLP)任务带来了新的突破。在本文中,我们提出了两种基于文本的“材料感知”语言模型,用于光学研究,分别是 OpticalBERT 和 OpticalPureBERT,它们是在光学材料领域的大量科学文献上训练的。在各种关于光学材料的文本挖掘任务中,这两个模型均优于 BERT 和之前的最先进模型。我们还发布了第一个基于表格的“材料感知”语言模型 OpticalTable-SQA。这是一个查询工具,它使用与该科学领域相关的表格信息来回答关于光学材料的问题。该 OpticalTable-SQA 模型是通过使用针对这项工作专门整理的手动标注的 OpticalTableQA 数据集对 Tapas-SQA 模型进行微调而实现的。在保留其在一般表格上的顺序问答性能的同时,该 OpticalTable-SQA 模型在光学材料相关表格上的表现明显优于 Tapas-SQA。所有模型和数据集都可供光学材料科学界使用。