Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, CA, Netherlands.
Elsevier B.V., Radarweg 29, Amsterdam NX, The Netherlands.
Database (Oxford). 2019 Jan 1;2019:baz001. doi: 10.1093/database/baz001.
In commercial research and development projects, public disclosure of new chemical compounds often takes place in patents. Only a small proportion of these compounds are published in journals, usually a few years after the patent. Patent authorities make available the patents but do not provide systematic continuous chemical annotations. Content databases such as Elsevier's Reaxys provide such services mostly based on manual excerptions, which are time-consuming and costly. Automatic text-mining approaches help overcome some of the limitations of the manual process. Different text-mining approaches exist to extract chemical entities from patents. The majority of them have been developed using sub-sections of patent documents and focus on mentions of compounds. Less attention has been given to relevancy of a compound in a patent. Relevancy of a compound to a patent is based on the patent's context. A relevant compound plays a major role within a patent. Identification of relevant compounds reduces the size of the extracted data and improves the usefulness of patent resources (e.g. supports identifying the main compounds). Annotators of databases like Reaxys only annotate relevant compounds. In this study, we design an automated system that extracts chemical entities from patents and classifies their relevance. The gold-standard set contained 18 789 chemical entity annotations. Of these, 10% were relevant compounds, 88% were irrelevant and 2% were equivocal. Our compound recognition system was based on proprietary tools. The performance (F-score) of the system on compound recognition was 84% on the development set and 86% on the test set. The relevancy classification system had an F-score of 86% on the development set and 82% on the test set. Our system can extract chemical compounds from patents and classify their relevance with high performance. This enables the extension of the Reaxys database by means of automation.
在商业研究和开发项目中,新化合物的公开通常发生在专利中。这些化合物中只有一小部分发表在期刊上,通常是在专利几年后。专利局提供专利,但不提供系统的连续化学注释。Elsevier 的 Reaxys 等内容数据库提供了此类服务,主要基于手动摘录,既费时又昂贵。自动文本挖掘方法有助于克服手动过程的一些限制。存在不同的文本挖掘方法来从专利中提取化学实体。其中大多数是使用专利文件的子部分开发的,重点是化合物的提及。对化合物在专利中的相关性关注较少。化合物在专利中的相关性基于专利的上下文。相关化合物在专利中起着重要作用。识别相关化合物可以缩小提取数据的大小并提高专利资源的有用性(例如,支持识别主要化合物)。Reaxys 等数据库的注释员仅注释相关化合物。在这项研究中,我们设计了一种从专利中提取化学实体并对其相关性进行分类的自动化系统。黄金标准集包含 18789 个化学实体注释。其中,10%是相关化合物,88%是不相关的,2%是模棱两可的。我们的化合物识别系统基于专有工具。该系统在化合物识别上的性能(F 分数)在开发集上为 84%,在测试集上为 86%。相关性分类系统在开发集上的 F 分数为 86%,在测试集上为 82%。我们的系统可以从专利中提取化学化合物并对其相关性进行分类,性能很高。这使得可以通过自动化扩展 Reaxys 数据库。