Zhai Zenan, Druckenbrodt Christian, Thorne Camilo, Akhondi Saber A, Nguyen Dat Quoc, Cohn Trevor, Verspoor Karin
School of Computing and Information Systems, The University of Melbourne, Melbourne, Australia.
Elsevier-Data Science, Life Science, Amsterdam, The Netherlands.
J Cheminform. 2021 Dec 11;13(1):97. doi: 10.1186/s13321-021-00568-2.
Chemical patents are a commonly used channel for disclosing novel compounds and reactions, and hence represent important resources for chemical and pharmaceutical research. Key chemical data in patents is often presented in tables. Both the number and the size of tables can be very large in patent documents. In addition, various types of information can be presented in tables in patents, including spectroscopic and physical data, or pharmacological use and effects of chemicals. Since images of Markush structures and merged cells are commonly used in these tables, their structure also shows substantial variation. This heterogeneity in content and structure of tables in chemical patents makes relevant information difficult to find. We therefore propose a new text mining task of automatically categorising tables in chemical patents based on their contents. Categorisation of tables based on the nature of their content can help to identify tables containing key information, improving the accessibility of information in patents that is highly relevant for new inventions. For developing and evaluating methods for the table classification task, we developed a new dataset, called CHEMTABLES, which consists of 788 chemical patent tables with labels of their content type. We introduce this data set in detail. We further establish strong baselines for the table classification task in chemical patents by applying state-of-the-art neural network models developed for natural language processing, including TabNet, ResNet and Table-BERT on CHEMTABLES. The best performing model, Table-BERT, achieves a performance of 88.66 micro-averaged [Formula: see text] score on the table classification task. The CHEMTABLES dataset is publicly available at https://doi.org/10.17632/g7tjh7tbrj.3 , subject to the CC BY NC 3.0 license. Code/models evaluated in this work are in a Github repository https://github.com/zenanz/ChemTables .
化学专利是公开新化合物和反应的常用渠道,因此是化学和制药研究的重要资源。专利中的关键化学数据通常以表格形式呈现。专利文件中表格的数量和规模可能都非常大。此外,专利表格中可以呈现各种类型的信息,包括光谱和物理数据,或化学品的药理用途和效果。由于这些表格中通常使用马库什结构和合并单元格的图像,其结构也呈现出很大的差异。化学专利表格在内容和结构上的这种异质性使得相关信息难以查找。因此,我们提出了一项新的文本挖掘任务,即根据化学专利表格的内容对其进行自动分类。根据表格内容的性质进行分类有助于识别包含关键信息的表格,提高专利中与新发明高度相关的信息的可获取性。为了开发和评估表格分类任务的方法,我们开发了一个名为CHEMTABLES的新数据集,它由788个化学专利表格组成,并带有其内容类型的标签。我们详细介绍了这个数据集。我们还通过在CHEMTABLES上应用为自然语言处理开发的先进神经网络模型,包括TabNet、ResNet和Table - BERT,为化学专利表格分类任务建立了强大的基线。表现最佳的模型Table - BERT在表格分类任务上的微平均[公式:见正文]分数达到了88.66。CHEMTABLES数据集可在https://doi.org/10.17632/g7tjh7tbrj.3上公开获取,遵循CC BY NC 3.0许可协议。本工作中评估的代码/模型存于Github仓库https://github.com/zenanz/ChemTables 。