Centre Biological Engineering, University of Minho, Braga 4710-057, Portugal; Silicolife Lda, Braga 4715-387, Portugal.
Centre Biological Engineering, University of Minho, Braga 4710-057, Portugal.
Comput Methods Programs Biomed. 2018 Jun;159:125-134. doi: 10.1016/j.cmpb.2018.03.012. Epub 2018 Mar 14.
The volume of biomedical literature has been increasing in the last years. Patent documents have also followed this trend, being important sources of biomedical knowledge, technical details and curated data, which are put together along the granting process. The field of Biomedical text mining (BioTM) has been creating solutions for the problems posed by the unstructured nature of natural language, which makes the search of information a challenging task. Several BioTM techniques can be applied to patents. From those, Information Retrieval (IR) includes processes where relevant data are obtained from collections of documents. In this work, the main goal was to build a patent pipeline addressing IR tasks over patent repositories to make these documents amenable to BioTM tasks.
The pipeline was developed within @Note2, an open-source computational framework for BioTM, adding a number of modules to the core libraries, including patent metadata and full text retrieval, PDF to text conversion and optical character recognition. Also, user interfaces were developed for the main operations materialized in a new @Note2 plug-in.
The integration of these tools in @Note2 opens opportunities to run BioTM tools over patent texts, including tasks from Information Extraction, such as Named Entity Recognition or Relation Extraction. We demonstrated the pipeline's main functions with a case study, using an available benchmark dataset from BioCreative challenges. Also, we show the use of the plug-in with a user query related to the production of vanillin.
This work makes available all the relevant content from patents to the scientific community, decreasing drastically the time required for this task, and provides graphical interfaces to ease the use of these tools.
近年来,生物医学文献的数量不断增加。专利文献也紧随这一趋势,成为生物医学知识、技术细节和经过整理的数据的重要来源,这些信息都是在授予专利的过程中汇集在一起的。生物医学文本挖掘(BioTM)领域一直在为自然语言的非结构化性质所带来的问题创建解决方案,这使得信息搜索成为一项具有挑战性的任务。有几种 BioTM 技术可应用于专利。其中,信息检索(IR)包括从文档集合中获取相关数据的过程。在这项工作中,主要目标是构建一个专利管道,解决专利库中的 IR 任务,使这些文档能够适应 BioTM 任务。
该管道是在 @Note2 中开发的,这是一个用于 BioTM 的开源计算框架,为核心库添加了许多模块,包括专利元数据和全文检索、PDF 到文本转换和光学字符识别。此外,还为主要操作开发了用户界面,这些操作体现在一个新的 @Note2 插件中。
这些工具在 @Note2 中的集成为在专利文本上运行 BioTM 工具提供了机会,包括信息提取任务,如命名实体识别或关系提取。我们通过一个案例研究展示了该管道的主要功能,使用了来自 BioCreative 挑战的可用基准数据集。此外,我们还展示了该插件的使用,包括与香草醛生产相关的用户查询。
这项工作使科学界能够获得专利的所有相关内容,大大减少了完成这项任务所需的时间,并提供了图形界面,以方便这些工具的使用。